Brisk guide to Mathematics
Jan Slovák
and
Martin Panák, Michal Bulant, Vladimir Ejov, Ray Booth
Brno, Adelaide, 2018
Authors:
Ray Booth Michal Bulant Vladimir Ezhov Martin Panák Jan Slovák
With further help of:
Aleš Návrat Michal Veselý
Graphics and illustrations:
Petra Rychlá
2018 Masaryk University, Flinders University
Contents - practice
Contents - theory
Chapter 1.   Initial warmup 4
A. Numbers and functions 4
B. Difference equations 10
C. Combinatorics 14
D. Probability 18
E. Plane geometry 28
F. Relations and mappings 41
G. Additional exercises for the whole chapter 48
Chapter 2.   Elementary linear algebra 71
A. Systems of linear equations and matrix manipulation 71
B. Permutations and determinants 86
C. Vector spaces, examples 95
D. Linear (in)dependence 98
E. Linear mappings 106
F. Inner products and linear maps 115
G. Eigenvalues and eigenvectors 119
H. Additional exercises for the whole chapter 127
Chapter 3.   Linear models and matrix calculus 142
A. Linear optimization 142
B. Difference equations 149
C. Population models 157
D. Markov processes 164
E. Unitary spaces 170
F. Matrix decompositions 174
G. Additional exercises for the whole chapter 204
Chapter 4.   Analytic geometry 231
A. Affine geometry 231
B. Euclidean geometry 240
C. Geometry of quadratic forms 256
D. Further exercise on this chapter 270
Chapter 5.   Establishing the ZOO 278
A. Polynomial interpolation 278
B. Topology of real numbers and their subsets 287
C. Limits 289
D. Continuity of functions 306
E. Derivatives 309
F. Extremal problems 315
G. LHospital's rule 329
H. Infinite series 335
I. Power series 341 J. Additional exercises for the whole chapter 346
Chapter 1.   Initial warmup 4
1. Numbers and functions 4
2. Difference equations 10
3. Combinatorics 14
4. Probability 18
5. Plane geometry 27
6. Relations and mappings 41
Chapter 2.   Elementary linear algebra 71
1. Vectors and matrices 71
2. Determinants 84
3. Vector spaces and linear mappings 95
4. Properties of linear mappings 115
Chapter 3.   Linear models and matrix calculus 142
1. Linear optimization 142
2. Difference equations 150
3. Iterated linear processes 160
4. More matrix calculus 169
5. Decompositions of the matrices and pseudoinversions 193
Chapter 4.   Analytic geometry 231
1. Affine and Euclidean geometry 231
2. Geometry of quadratic forms 253
3. Projective geometry 260
Chapter 5.   Establishing the ZOO 278
1. Polynomial interpolation 278
2. Real numbers and limit processes 290
3. Derivatives 313
4. Infinite sums and power series 327
Chapter 6.   Differential and integral calculus 372
1. Differentiation 372
2. Integration 392
3. Sequences, series and limit processes 418
Chapter 7.   Continuous tools for modelling 450
1. Fourier series 450
2. Integral operators 472
3. Metric spaces 483
Chapter 8.   Calculus with more variables 510
1. Functions and mappings on K™ 510
2. Integration for the second time 548
3. Differential equations 562
Chapter 6.   Differential and integral calculus 372
A. Derivatives of higher orders 372
B. Integration 392
C. Power series 426
D. Extra examples for the whole chapter 440
Chapter 7.   Continuous tools for modelling 450
A. Orthogonal systems of functions 450
B. Fourier series 454
C. Convolution and Fourier Transform 470
D. Laplace Transform 483
E. Metric spaces 485
F. Convergence 492
G. Topology 497
H. Additional exercises to the whole chapter 502
Chapter 8.   Calculus with more variables 510
A. Multivariate functions 510
B. The topology of En 513
C. Limits and continuity of multivariate functions 515
D. Tangent lines, tangent planes, graphs of multivariate functions 517
E. Taylor polynomials 526
F. Externa of multivariate functions 527
G. Implicitly given functions and mappings 532
H. Constrained optimization 534
I. Volumes, areas, centroids of solids 549 J. First-order differential equations 567 K.   Practical problems leading to differential
equations 578
L.   Higher-order differential equations 580
M.   Applications of the Laplace transform 588
N.   Numerical solution of differential equations 591
0. Additional exercises to the whole chapter 595
Chapter 9.   Continuous models - further selected topics 606
A. Exeterior differential calculus 606
B. Applications of Stake's theorem 606
C. Equation of heat conduction 612
Chapter 10.   Statistics and probability methods 660
A. Dots, lines, rectangles 660
B. Visualization of multidimensional data 669
C. Classical and conditional probability 672
D. What is probability? 680
E. Random variables, density, distribution function 683
F. Expected value, correlation 694
G. Transformations of random variables 699
H. Inequalities and limit theorems 701
1. Testing samples from the normal distribution 707 J. Linear regression 718 K. Bayesian data analysis 720 L.   Processing of multidimensional data 725
Chapter 11.   Number theory 737
A. Basic properties of divisibility 737
B. Congruences 742
Chapter 9.   Continuous models - further selected topics 606
1. Exterior differential calculus and integration 606
2. Remarks on Partial Differential Equations 630
3. Remarks on Variational Calculus 659
4. Complex Analytic Functions 659
Chapter 10.   Statistics and probability theory 660
1. Descriptive statistics 660
2. Probability 672
3. Mathematical statistics 718
Chapter 11.   Elementary number theory 737
1. Fundamental concepts 737
2. Primes 742
3. Congruences and basic theorems 749
4. Solving congruences and systems of them 759
5. Applications - calculation with large integers, cryptography 774
Chapter 12.   Algebraic structures 796
1. Posets and Boolean algebras 796
2. Polynomial rings 811
3. Groups 826
4. Coding theory 846
5. Systems of polynomial equations 853
Chapter 13.   Combinatorial methods, graphs, and
algorithms 877
1. Elements of Graph theory 877
2. A few graph algorithms 903
3. Remarks on Computational Geometry 925
4. Remarks on more advanced combinatorial calculations 943
C. Solving congruences 759
D. Diophantine equations 777
E. Primality tests 785
F. Encryption 789
G. Additional exercises to the whole chapter 793
Chapter 12.   Algebraic structures 796
A. Boolean algebras and lattices 796
B. Rings 803
C. Polynomial rings 805
D. Rings of multivariate polynomials 811
E. Algebraic structures 816 E Groups 819
G. Burnside's lemma 837
H. Codes 841
I. Extension of the stereographic projection 848 J. Elliptic curve 849 K.   Grobner bases 853
Chapter 13.   Combinatorial methods, graphs, and
algorithms 877
A. Fundamental concepts 877
B. Fundamental algorithms 883
C. Minimum spanning tree 894
D. Flow networks 896
E. Classical probability and combinatorics 900
F. More advanced problems from combinatorics 904
G. Probability in combinatorics 906
H. Combinatorial games 914
I. Generating functions 917 J. Additional exercises to the whole chapter 957
Preface
The motivation for this textbook came from many years of lecturing Mathematics at the Faculty of Informatics at the Masaryk University in Brno. The programme requires introduction to genuine mathematical thinking and precision. The endeavor was undertaken by Jan Slovák and Martin Panák since 2004, with further collaborators joining later. Our goal was to cover seriously, but quickly, about as much of mathematical methods as usually seen in bigger courses in the classical Science and Technology programmes. At the same time, we did not want to give up the completeness and correctness of the mathematical exposition. We wanted to introduce and explain more demanding parts of Mathematics together with elementary explicit examples how to use the concepts and results in practice. But we did not want to decide how much of theory or practice the reader should enjoy and in which order.
All these requirements have lead us to the two column format of the textbook, where the theoretical explanation on one side and the practical procedures and exercises on the other side are split. This way, we want to encourage and help the readers to find their own way. Either to go through the examples and algorithms first, and then to come to explanations why the things work, or the other way round. We also hope to overcome the usual stress of the readers horrified by the amount of the stuff. With our text, they are not supposed to read through the book in a linear order. On the opposite, the readers should enjoy browsing through the text and finding their own thrilling paths through the new mathematical landscapes.
In both columns, we intend to present rather standard exposition of basic Mathematics, but focusing on the essence of the concepts and their relations. The exercises are addressing simple mathematical problems but we also try to show the exploitation of mathematical models in practice as much as possible.
We are aware that the text is written in a very compact and non-homogeneous way. A lot of details are left to readers, in particular in the more difficult paragraphs, while we try to provide a lot of simple intuitive explanation when introducing new concepts or formulating important theorems. Similarly, the examples display the variety from very simple ones to those requesting independent thinking.
We would very much like to help the reader:
• to formulate precise definitions of basic concepts and to prove simple mathematical results;
• to percieve the meaning of roughly formulated properties, relations and outlooks for exploring mathematical tools;
• to understand the instructions and algorithms underlying mathematical models and to appreciate their usage.
These goals are ambitious and there are no simple paths reaching them without failures on the way. This is one of the reasons why we come back to basic ideas and concepts several times with growing complexity and width of the discussions. Of course, this might also look chaotic but we very much hope that this approach gives a better chance to those who will persist in their efforts. We also hope, this textbook should be a perfect beginning and help for everybody who is ready to think and who is ready to return back to earlier parts again and again.
To make the task simpler and more enjoyable, we have added what we call "emotive icons". We hope they will spirit the dry mathematical text and indicate which parts should be read more carefully, or better left out in the first round.
The usage of the icons follows the feelings of the authors and we tried to use them in a systematic way. We hope the readers will assign the meaning to icons individually. Roughly speaking, we are using icons to indicate complexity, difficulty etc.:
Further icons indicate unpleasant technicality and need of patiance, or possible entertainment and pleasure:
Similarly, we use various icons in the practical column:
The practical column with the solved problems and exercises should be readable nearly independently of the theory. Without the ambition to know the deeper reasons why the algorithms work, it should be possible to read mainly just this column. In order to help such readers, some definitions and descriptions in the theoretical text are marked in order to catch the eyes easily when reading the exercises. The exercises and theory are partly coordinated to allow jumping there and back, but the links are not tight. The numbering in the two columns is distinguished by using the different numberings of sections, i.e. those like 1.2.1 belong to the theoretical column, while 1.B.4 points to the practical column. The equations are numbered within subsections and their quotes include the subsection numbers if necessary.
In general, our approach stresses the fact that the methods of the so called discrete Mathematics seem to be more important for mathematical models nowadays. They seem also simpler to get percieved and grasped.
However, the continuous methods are strictly necessary too. First of all, the classical continuous mathematical analysis is essential for understanding of convergence and robustness of computations. It is hard to imagine how to deal with error estimates and computational complexity of numerical processes without it. Moreover, the continuous models are often the efficient and effectively computable approximations to discrete problems coming from practice.
The rough structure of the book and the dependencies between its chapters are depicted in the diagram below. The darker the color is, the more demanding is the particular chapter (or at least its essential parts). In particular, the chapters 7 and 9 include a lot of material which would perhaps not be covered in the regular course activities or required at exams in great detail. The solid arrows mean strong dependencies, while the dashed links indicate only partial dependencies. In particular, the textbook could support courses starting with any of the white boxes, i.e. aiming at standard linear algebra and geometry (chapters 2 through 4), discrete chapters of mathematics (11 through 13), and the rudiments of Calculus (5, 6, 8).
í. Initial warm -V
Ti
5. Establishing the ZOO
6. Differential and integral calculus
7. Continuous toots for modelling
<■--
2. Elementary linear algebra I I I
___\ \
11. Elementary number theory
4. Analytic geometry -J K
3. Linear models and
matrix calculus t
8. Calculus with more variables
9. Continuous models -further selected topics
"7
10. Probability and statistics
%       \ 1
V        1 '
12. Algebraic structures
- >
13. Combinatorics, graphs, and algorithms
All topics covered in the book have been included (with more or less details) in our teaching of large four semester courses of Mathematics, complemented by numerical seminars, since 2005. In our teaching, the first semester covered chapters 1 and 2 and selected topics from chapters 3 and 4. The second semester fully included chapters 5 and 6 and selected topics from chapter 7. The third semester was split into two parts. The first one was covered by chapter 8, while the rest of the semester was devoted to chapter 10 (with only a few glimpses towards the more advanced topics from chapter 9). The last semester provided large parts of the content of chapters 11 through 13, although the entire graph theory was skipped (since it was tought elsewhere). Actually, the second semester could be offered in parallel with the first one, while the fourth semester could follow immediately after the first one. Indeed, some students were advised to go for the second and fourth semester simultaneously (those in the IT security programme).
CHAPTER 1
Initial warmup
"value, difference, position "
- what it is and how to comprehend it?
A. Numbers and functions
We can already work with natural, integer, rational and real numbers. We explain why rational numbers are not sufficient for us (although computers are actually not able to work with any other) and we recall the complex numbers (because even the real numbers are not adequate for some calculations).
l.A.l. Show that the integer 2 does not have a rational square root.
Solution. Already the ancient Greeks knew that if we prescribe the area of square as a2 = 2, then we cannot find a rational a to satisfy it. Why?
Assume we know that (p/q)2 = 2 for natural numbers p and q that do not have common divisors greater than 1 (otherwise we can further reduce the fraction p/q). Then p2 = 2q2 is an even number. Thus, on the left-hand side p2 is even. Therefore so is p because the alternative that p is odd would imply the contradiction that p2 is odd. Hence, p is even and so p2 is divisible by 4. So q2 is even and so q must be even too. This implies thatp and q both have 2 as a common factor, which is a contradiction. □
The goal of this first chapter is to introduce the reader to the fascinating world of mathematical thinking.
The name of this chapter can be also understood as an encouragement for patience. Even the simplest tasks and ideas are easy only for those who have already seen similar ones. A full knowledge of mathematical thinking can be reached only through a long and complicated course of study.
We start with the simplest thing: numbers.
They will also serve as the first example of how mathematical objects and theories are built. The entire first chapter will become a quick tour through various mathematical landscapes (including germs of analysis, combinatorics, probability, geometry).
Perhaps sometimes our definitions and ideas will often look too complicated and not practical enough. The simpler the objects and tasks are, the more difficult the mastering of depth and all nuances of the relevant tools and procedures might be. We shall come back to all of the notions again and again in the further chapters and hopefully this will be the crucial step in the ultimate understanding.
Thus the advice: do not worry if you find some particular part of the exposition too formal or otherwise difficult - come back later for another look.
1. Numbers and functions
Since the dawn of time, people want to know "how much" about something they have, or "how much" is something worth, "how long" will a particular task take, etc. The answer for such ideas is usually some We consider something to be a number, if it behaves according to the usual rules - either according to all the rules we accept, or maybe 5 only to some of them. For instance, the result of multiplication does not depend on the order of multiplicands. We have the number zero whose addition to another number does not change the result. We have the number one whose product with another number does not change the result. And so on.
The simplest example of numbers are the positive integers which we denote Z+ = {1,2,3,...}. The natural numbers consist of either just the positive integers, or the positive integers together with the number zero. The number zero is
kind of "number"
CHAPTER 1. INITIAL WARM UP
l.A.2. Remark. It can be proved that for all positive natural numbers n and x the n-th root tfx of x is either natural or it is not rational, see l.G.l.
A M dia^otvtls o£ SjirfWS irrational ----+ —jl*--
2.   3 H
Next, we work out some examples with complex numbers. If you are not familiar with the basic concepts and properties of complex numbers, consult the paragraphs 1.1.3 through 1.1.4 in the other column.
I.A.3.   Calculate z1 + z2, z1 ■ z2, z\, \z2\, for
a) z\ = \ — 2i, z2 = 4i — 3;
b) z\ = 2, z2 = i.
Solution.
a) z1 + z2 = 1 - 3 - 2i + 4i = -2 + 2i, z1 ■ z2 1 ■ (-3) - 8i2 + 6i + 4i = 5 + 10i, zx = 1 + 2i, \z2\
\/42 + (-3)2
2 •
5, a
22
l-(-3)+8r+6i-4a 25
-11 +
b) zi + z2 = 2 + i, zx ■ z2 = 2i, zx = 2, \z2\ = 1, f£ =
= -2i.
I.A.4. Determine
□
(2+3i)(l+iV^) l-i%/3
Solution. Since the absolute value of the product (ratio) of any two complex numbers is the product (ratio) of their absolute values and every complex number has the same absolute value as its complex conjugate, we have that
either considered to be a natural number, as is usual in computer science, or not a natural number as is usual in some other contexts. Thus the set of natural numbers is either Z+, or the setN = {0,1, 2,3,... }. To count "one, two, three,..." is learned already by children in their pre-school age. Later on, we meet all the integers Z = {..., —2, —1,0,1,2,... } and finally we get used to floating-point numbers. We know what a 1.19-multiple of the price means if we have a 19% tax.
1.1.1. Properties of numbers. In order to be able to work ■ properly with numbers, we need to be careful with their definition and properties. In mathematics, the basic statements about properties of objects, whose validity is assumed without the need to prove them, are called axioms.
We list the basic properties of the operations of addition and multiplication for our calculations with numbers, which we denote by letters a,b,c,.... Both operations work by taking two numbers a, b. By applying addition or multiplication we obtain the resulting values a + b and a ■ b.
Properties of numbers
Properties of addition:
(CGI)        (a + b) + c = a+ (b + c), for all a, b, c (CG2) a + b = b + a, for all a, b
(CG3) there exists 0 such that for all a, a + 0 = a (CG4)      for all a there exists b such that a + b = 0.
The properties (CG1)-(CG4) are called the properties of a commutative group. They are called respectively associativity, commutativity, the existence of a neutral element (when speaking of addition we usually say zero element), and the existence of an inverse element (when speaking of addition we also say the negative of a and denote it by —a). Properties of multiplication: (Rl) (a-b) ■ c = a - (b- c), for all a, b, c
(R2) a ■ b = b ■ a, for all a, b
(R3) there exists 1 such that for all a 1 ■ a = a (R4)       a ■ (b + c) = a ■ b + a ■ c, for all a, b, c.
The properties (R1)-(R4) are called respectively associativity, commutativity, the existence of a unit element and dis-tributivity of addition with respect to multiplication. The sets with operation +, ■ that satisfy the properties (CG1)-(CG2) and (R1)-(R4) are called commutative rings. Two further properties of multiplication are:
(F)     for every a^O there exists b such that a-b = \.
(ID)    if  a-b = 0, then either a = 0 or b = 0 or both.
The property (F) is called the existence of an inverse element with respect to multiplication (this element is then denoted by a-1. For normal arithmetic, this is called the reciprocal of a, the same as 1 la or -.
5
CHAPTER 1. INITIAL WARM UP
V
J"
z -
1"
-1--Z-
(2+3i)(l-Hy5) _
______lt$A>
fib
2 + 3i|
V22 + 32
|l+i\/3| 1—i%/3
Vi3.
= |2 + 3i|
□
l.A.5.   Simplify the expression (5-\/3 + 5i)" for n = 2 and
n = 12.
Solution. Using binomial theorem for n = 2 we get
(5^ + 5i)2 = 75 + lOv^ ■ 5i - 25 = 50 + 50^-Taking powers one by one or doing an expansion using binomial theorem are in the case n = 12 too much time-consuming. Let us rather write the number in polar form
5^ + 5^ = 10       + |) = 10 (cos | +isin|) and using de Moivre theorem we easily obtain
(5^ + 5^
101
(cos      + i sin ±jA
101
□
l.A.6. Determine the distance d of the numbers z, z in the complex plane for
„• 3
2 2'
Solution. It is not difficult to realize that complex conjugates are in the complex plane symmetric with respect to the x-axis and the distance of a complex number from the x-axis equals its imaginary part. That gives d = 3. □
l.A.7. Express the number z1 = 2 + 3i in polar form. Express the number 22 = 3(cos(7r/3) + i sin(7r/3)) in algebraic form.
Solution. The absolute value of \z1 (the distance of the point with Cartesian coordinates [2,3] in the plane from the origin) is V22 + 32 = \/T3. From the right triangle in the diagram
The property (ID) then says that there exists no "divisors of zero". A divisor of zero is a number a, a ^ 0, such that there is a number b, b ^ 0, with ab = 0.
1.1.2. Remarks. The integers Z are a good example of a commutative group. The natural numbers are not such an example since they do not satisfy (CG4) (and possibly do not even contain the neutral element if one does not consider zero to be a natural number). If a commutative ring also satisfies the property (F), we speak of a field (often also about a commutative field).
The last stated property (ID) is automatically satisfied if (F) holds. However, the converse statement is false. Thus we say that the property (ID) is weaker than (F). For example, the ring of integers Z does not satisfy (F) but does satisfy (ID). In such a case we use the term integral domain.
Notice that the set of all non-zero elements in the field along with the operation of multiplication satisfies (Rl), (R2), (R3), (F) and thus is also a commutative group. However in this case, instead of addition we speak of multiplication. As an example, the set of all non-zero real numbers forms a commutative group under multiplication.
The elements of some set with operations + and ■ satisfying (not necessarily all) stated properties (for example, a commutative field, an integral domain) may be called scalars. To denote them we usually use lowercase Latin letters, either from the beginning or from the end of the alphabet.
We will use only these properties of scalars and thus our results will hold for any objects with such properties. This is the true power of mathematical theories - they do not hold just for a specific solved example. Quite the opposite, when we build ideas in a rational way they are always universal. We will try to emphasise this aspect, although our ambitions are modest due to the limited size of this book.
Before coming to any use of scalars, we should make a short formal detour and pay attention to its existence. We shall come back to this in the very end of this chapter, when we shall deal with the formal language of Mathematics in general, cf. the constructions starting in 1.6.5. There we indicate how to get natural numbers N, integers Z, and rational numbers Q, while the real numbers R will be treated much later in chapter 5.
At this point, let us just remark that it is not enough to pose the axioms of objects. We have to be sure that the given conditions are not in conflict and such objects might exist.
We suppose the readers are sure about the existence of the domains N, Z, Q and can handle them easily. The real numbers are usually understood as a dense and better version of Q, but what about the domain of complex numbers?
As is usual in mathematics, we will use variables (letters of alphabet or other symbols) to denote numbers, and it does not matter whether we know their value beforehand or not.
6
CHAPTER 1. INITIAL WARM UP
we compute sin(y>) = 3/V13, cos(yi) =2/vl3. Thus <£> = arcsin(3/v/13) = arccos(2/v/13) = 56.3°. In total,
(2 3 Vl3 Vl3 = \/l3 ^cos ^arccos ^-^=^ + i sin ^arcsin ("^jj
Transition from polar form to algebraic form is even simpler:
.2 = 3(cos g)+ising))=3(i+i.^
3 . 3^3 2 +l' ~'
□
l.A.8.   Express z = cos 0 + cos | + i sin | in polar form.
Solution. To express number z in polar form, we need to find its absolute value and argument. First we calculate the absolute value:
cos 0 + cos -
+ sin
\
1 +
For the argument p, we have:
Re(z) = 1+i =
COS p ■■
+
2 '
V3 2
smp
Vs.
Im(z
therefore p = tt/Q. Thus
VS (cos ^ + i sin ^)
□
l.A.9.   Using de Moivre theorem, calculate
(cos f + i sin j)31 .
1.1.3. Complex numbers. We are forced to extend the domain of real numbers as soon as we want to see solutions of equations like x2 = b for all real numbers b.
We know that this equation always has a solution x in the domain of real numbers, whenever b is non-negative. If b < 0, then such a real x cannot exist. Thus we need to find a larger domain, where this equation has a solution.
The crucial idea is to add the new number i to the real numbers, the imaginary unit, for which we require i2 = — 1. Next we try to extend the definitions of addition and multiplication in order to preserve the usual behaviour of numbers (as summarised in 1.1.1).
Clearly we need to be able to multiply the new number i by real numbers and sum it with real numbers. Therefore we need to work in our newly defined domain of complex numbers C with formal expressions of the form z = a + ib, being called algebraic form of z. The real number a is called the real part of the complex number z, the real number b is called the imaginary part of the complex number z, and we write Re(z) = a, lm(z) = b. It should be noted that if z = a + i b and w = c + id then z = w implies both a = c and b = d. In other words, we can equate both real and imaginary parts. For positive x we then get (i ■ x)2 = —l-x2 and thus we can solve the equations as requested.
In order to satisfy all the properties of associativity and distributivity, we define the addition so that we add independently the real parts and the imaginary parts. Similarly, we want the multiplication to behave as if we multiply the pairs of real numbers, with the additional rule that i2 = —1, thus
(a + i b) + (c + i d) = (a + c) + i (b + d),
(a + ib) ■ (c + i d) = (ac — bd) + i (be + ad).
Next, we have to verify all the properties (CGI-4), (Rl-4) and (F) of scalars from 1.1.1. But this is an easy exercise: zero is the number 0 + i 0, one is the number 1 + i 0, both these numbers are for simplicity denoted as before, that is, 0 and 1. For non-zero z = a + i b we easily check that z~x = (a2 + b2)~1(a — i b). All other properties are obtained by direct calculations.
1.1.4. The complex plane and polar form. A complex number is given by a pair of real numbers, therefore it corresponds to a point in the real plane
I2. Our algebraic form of the complex numbers z = x + iy corresponds in this picture to understanding the x-coordinate axis as the real part while the y-coordinate axis is the imaginary part of the number. The absolute value of the complex number z is defined as its distance from the origin, thus \z\ = \Jx2 + y2.
The reflection with respect to the real axis then corresponds to changing the sign of the imaginary part. We call this operation z     z = x — iy the complex conjugation.
Let us now consider complex numbers of the form z = cos ip + i sin p, where p is a real parameter giving the angle between the real axis and the line from the origin to z (measured in the positive, i.e. counter-clockwise sense). These
7
CHAPTER 1. INITIAL WARM UP
Solution. We obtain
tt 7T\ 31
cos--\-i sin -
6 6,
317T 317T
cos--\-i sin-
6 6
7tt 7tt
■ cos--\-i sm —
6 6
2    % 2
□
1.A.10. Is the "square root" well defined function in complex numbers?
Solution. No, it is only defined as a function with domain being non-negative real numbers and the image being the same set.
In the complex domain, for any complex number z (except zero) there are two complex numbers such that their square is equal z. Both can be called square root and they differ by sign (square root of — 1 is according to this definition i as well as — i). □
l.A.ll. Complex numbers are not just a tool to obtain "weird" solutions to quadratic equations. They are necessary to determine solutions to cubic equations, even if these solutions are real. How can we express solution to the cubic equation
a;3 + ax2 + bx + c = 0 in real coefficients a, b, c? We show a method developed in sixteenth century by Ferro, Cardano, Tartaglia and possibly others. Substitute x := t — a/3 (to remove the quadratic part from the equation) to obtain the equation:
t3 + pt + q = 0,
where p = b - a2/3 and q = c + (2a3 - 9a&)/27. Now introduce unknowns u, v satisfying the conditions u + v = t and 3uv + p = 0. Substitute the first condition into the previous equation to obtain
u3 + v3 + (3uv +p)(u + v) + q = 0.
Now use the second equation to eliminate v. This yields
u6 + qu3
27
which is a quadratic equation in the unknown s = u3. Thus
q2 p3 T + 27'
By back substitution, we obtain
x = —p/3u + u — a/3.
In the expression for u there is cube root. In order to obtain all three solutions we need to work with complex roots. The
numbers describe all points on the unit circle in the complex plane. Every non-zero complex number z can be then written as
z = z | (cos p + i sin p).
For given z ^ 0, ip is unique if 0 < ip < 2ir. The number p is called the argument of the complex number z and this form of z is called the polar form of the complex number. This way of writing the complex numbers is very convenient for understanding the multiplication.
Consider the numbers z = \z\ (cos 1,2 + i smp) andw = \w\ (cos t/j + i sint/j) and calculate their product
z ■ w = \z\ (cos p + i sin p) Iw (cos ip + i sin ip)
= \z\\w \ (cospcosip — sinpsimp
+ i (cos p sinip + sin p cos ip))
= \z\\w\(cos(</9 + ip) + i sin(</9 + ip)).
The last equality is a result of the addition formulas for trigonometric functions (we shall deal with them in more detail later in our discussion of rotations in the plane, see the page 38.
Division is equally easy. If z = \z\ (cos ip + isinip) ^ 0, then w = z|_1 (cosp — isirnp) satisfies zw = wz = 1, hence we can write w = z~x = 1/z.
We can summarize (and iterate the application of the previous formula on the product of the number z with itself):
Polar form and de Moivre Theorem
Consider two complex numbers z = 121(cos 1,2 + i smp) and w = \w\(cosip + i sinip) in polar forms. Then if n is an integer, positive or negative,
zw = \z\ \w\{cos{p + ip) + i sin(</9 + ip))
zn = \z\n(cos(np) +i sin(np)).
1.1.5. Functions. In most tasks we do not deal just with numbers, i.e. with individual values of scalars. More often the values are associated to each of the elements in a set of objects.
Formally we talk about a mapping f : A —> B assigning to each element x in the domain set A the value f(x) in the codomain set B. The set of all images f(x)eB is called the range of /.
The set A or B can be a set of numbers, but there is nothing to stop them being sets of other objects. The mapping /, however it is described, must unambiguously determine a unique member of B for each member of A.
In another terminology, the member x e A, is often called the independent variable. Then y = f(x) G B, is called the dependent variable. We also say that the value y = j(x) is a. function of the independent variable x in the domain of /.
For now, we shall restrict ourselves to the case where the codomain B is a subset of scalars and we shall talk about scalarfunctions.
8
CHAPTER 1. INITIAL WARM UP
equation a;3 = a, a ^ 0, with the unknown x has exactly three solutions in the domain of complex numbers (the fundamental theorem of algebra, see (12.2.8) on page 820). All these three solutions are called cube roots of a. Therefore the expression y/a has three meanings in the complex domain. If we want a single meaning for that expression, we usually consider it to be the solution with the smallest argument.
1.A.12. Show that the roots £i, £2, ■ ■ ■, Cn of the equation xn = 1 form the vertices of the regular n-gon in the plane of the complex numbers.
Solution. The argument of the roots is given by de Moivre theorem, namely the argument multiplied by n has to be a multiple of 2tt, the absolute value has to be one, so the roots
are = cos(k^) + i sin(fc^L), k = 1,..., n, which are indeed the vertices of a regular polygon. □
1.A.13. Show that the roots £i, £2,• • •, £n of the equation xn = 1 satisfy
n
£fc = °-
i=l
Solution. Let £i be the root with the smallest positive argument. The other roots satisfy = £f (see the previous example), thus
n n
i=l i=l ^
= 0,
where we have summed up the geometric sequence £i .
□
More examples about complex numbers can be found in the end of the chapter, starting at l.G.l.
1.A.14.   Solve the equation
a;3 + x2 - 2x - 1 = 0.
Solution. This equation has no rational roots (methods to determine rational roots will be introduced later, see (??)). Substitution into formulas obtained in l.A.ll yields p =
b - a2/3 = -7/3, q = -7/27. It follows that
x/28 ± 12v/r14f
U=-6-'
We can theoretically choose up to six possibilities for u (two for the choice of the sign and three independent choices of the
The simplest way to define a function appears if A is a finite set. Then we can describe the function / by a table or a listing showing the image of each member of A. We have certainly seen many examples of such functions:
Let / denote the pay of a worker in some company in certain year. The values of independent variable, that is, the domain of the function, are individual workers x from the set of all considered workers. The value j(x) is their pay for the given year. Similarly we can talk about the age of students or their teachers in years, the litres of beer and wine consumed by individuals from a given group, etc.
Another example is a food dispensing machine. The domain of a function / would be the button pushed together with the money inserted to determine the selection of the food item
Let A = {1,2,3} = B. The set of equalities/(l) = 1, /(2) = 3, /(3) = 3, defines a function / : A -> B. Generally, as there are 3 possible values for /(l), and the same for /(2), and /(3), there are 27 possible functions from A into B in total.
But there are other ways to define a function than as a table. For example, the function / can denote the area of a planar region. Here, the domain consists of subsets of the plane (e.g. all triangles, circles or other planar regions with a defined area). The range of / consists of the respective areas of the regions. Rather than providing a list of areas for a finite number regions, we hope for a formula allowing us to compute the functional value f(P) for any given planar region P from a suitable class.
Of course, there are many simple functions given by formulas, like the formula f(x) = 3x + 7 with A = B = R or A = B = N.
Not all functions can be given by a formula or list. For example, let j(t) denote the speed of the car at time t. For any given car and time t, we know there will be the functional values j(t) denoting its speed. Which can of course be measured approximately, but usually not by a formula.
Another example: Let j(n) be the nth digit in the decimal expansion of it = 3.1415 .... So for example /(4) = 5. The value of / (n) is defined but unknown if n is large enough.
The mathematical approach in modelling real problems often starts from the indication of certain dependencies between some quantities and aims at explicit formulas for functions which describe them. Often a full formula is not available but we may obtain the values j(x) at least for some instances of the independent variable x, or we may be able to find a suitable approximation.
We shall see all of the following types of expressions of the requested function / in this book:
• exact finite expression (like the function j(x) = 3a; + 7 above);
• infinite expression (we shall come to that only much later in chapter 5 when introducing the limit processes);
• description of how the function's values change under a given change of the independent variable (this behaviour
9
CHAPTER 1. INITIAL WARM UP
cubic root). But we obtain only three distinct values for x. By substitution into the formulas, one of the roots is of the form
14
,_= + ^^^-1 = 1.247,
^3(28 - 84iv^) b 6
similarly for the other two (approximately —0.445 and —1.802). As noted before, we see that even if we have used complex numbers during the computation, all the solutions are real. □
B. Difference equations
Difference equations (also called recurrence relations) are relations between elements of a sequence, where an element of the sequence depends on 11 previous elements. To solve a difference equation means finding an explicit formula for n-th (that is, arbitrary) element of the sequence.
If an element of the sequence is determined only by the previous element, we call it a first order difference equation. This is a common real world problem, for instance when we want to find out how long will repayment of a loan take for fixed monthly repayment, or when we want to know how much shall we pay per month if we want to repay a loan in a fixed time.
l.B.l. Michael wants to buy a new car. The car costs € 30 000. Michael wants to take out a loan and repay it with a fixed month repayment. The car company offers him a loan to buy the car with yearly interest of 6%. The repayment starts at the end of the first month of the loan. Michael would like to finish repaying the loan in three years. How much should he pay per month?
Solution. Let P denote the sum Michael has to pay per month. After the first month Michael repays P, part of it is a repayment of the loan, part of it pays the interest. Let dk stand for the loan after k months and write C = 30 000 for the price of the car, and u = for the monthly interest rate. We know do = C = 30 000 and after the first month there is
d1 = C - P + u ■ C.
In general, after the fc-th month we have
(1)    dk = dk_i — P + m4-i = (1 + u)dk-i — P-
will be displayed under the name difference equation in a moment and under different circumstances later on);
• approximation of a not computable function with a known one (usually including some error estimates -this could be the case with the car above, say we know it goes with some known speed at the time t = 0, we break as much as possible on a known surface and we compute the decrease of speed with the help of some mathematical model);
• finding only the probability of possible values of the function. For example the function giving the length of life of a given group of still living people, in dependence of some health related parameters.
1.1.6. Functions denned explicitly. Let us start with the most desirable case, when the function values are defined by a computable finite formula. Of course, we shall be interested also in the efficiency of the formulas, i.e. how fast the evaluations would be. In principle, real computations can involve only a finite number of summations and multiplications of numbers. This is how we define the polynomials, i.e. function of the form f(x) = an-xn+- ■ ■ +ai -x+a0, where a0,... ,an are known scalars, x is the unknown variable whose value we can insert. xn = 1 ■ x ■ ■ ■ x means the n-times repeated multiplication of the unit by x (in particular, a;0 = 1), and j(x) is the value of the indicated sum of products. This is fairly well computable formula for each n e N. The choice n = 0 provides the constant a0.
The next example is more complicated.
Factorial function
Let A = Z+ be the set of positive integers. For each n e Z+, define the factorial function by
n\ =n(n-l)(n-2)...3-2-l.
For convenience we also define 0! = 1. (We will see why this is sensible later on). It is easy to see that n\ = n ■ (n — 1)! for all n > 1.
So 1! = 1, 2! = 2 ■ 1 = 2, 3! = 3 ■ 2 ■ 1 = 6, 6! = 720 etc.
The latter example deserves more attention. Notice that we could have defined the factorial by settings = B = Nand giving the equation j(n) = n ■ f(n — 1) for all n > 1. This does not yet define /, but for each n, it does determine what j(n) is in terms of its predecessor f(n — 1). This is sometimes called a recurrence relation. After choosing /(0) = 1, the recurrence now determines /(l) and hence successively /(2) etc., and so a function is defined. It is the factorial function as described above.
2. Difference equations
The factorial function is one example of a function which can be defined on the natural numbers by means of a recurrence relation.
10
CHAPTER 1. INITIAL WARM UP
Using the relation (1) from paragraph 1.2.3 we obtain dk given by (we write a = l + u)
' ak — 1N
du = dn aK — P '
a- 1
Repaying the loan in three years means d3G = 0, thus
(1 + u)36u
P = 30 000
(1 + u)36 - 1
- so000 f(12-06/12)36(0-°6/12^ - Q12 7
- 3000°V    (12.06/12)36-1    J"912'7' n
Note that the recurrence relation (1) can be used for our case as long as all y (n) are positive, that is, as long as Michael still has to repay something.
I.B.2. Consider the case from the previous example. For how long would Michael have to pay, if he repays €500 per month?
Solution. Setting as before a = (l + ) = 1.005, C = 30 000 the condition dk = 0 gives the equation
k        T^T 200P
-E--c 2oop-<t
a—l
By taking logarithms of both sides, we obtain
ln(200P) - ln(200P - C)
k ■■
In a
which for P = 500 gives approximately k = 71.5, thus Michael would be paying for 72 months (the last repayment would be less than €500). □
l.B.3. Determine the sequence {yn}^=i, which satisfies the following recurrence relation
3y«
Vn+i = ~Y + 1, n > 1, yi = 1.
o
Linear recurrences can naturally appear in geometric problems:
l.B.4. Suppose n lines divide the plane into regions. What is the maximum number of regions that can be formed in this way?
Solution. Let the number of regions be pn. If there is no line in the plane, then the whole plane is one region, thus p$ = l. If there are n lines, then adding an (n + l)-st line increases the number of regions by the number of regions this new line intersects. If no lines are parallel and no three lines intersect at the same point, the number of regions the (n + l)-st line crosses is one plus the number of its intersections with the previous lines (the crossed area will then be divided into two, thus the total number increases by one at every crossing).
Such a situation can often be seen when formulating mathematical models that describe real systems in economy, biology, etc. We will observe here only a few simple examples and return to this topic in chapter 3.
1.2.1. Linear difference equations of first order. A general difference equation of the first order (or first order recurrence) is an expression of the form
f(n + l)=F(n,f(n)),
where F is a known function with two arguments (independent variables). If we know the "initial" value /(0), we can compute /(l) = F(0, /(0)), then /(2) = F(l, /(l)) and so on. Using this, we can compute the value j(n) for arbitrary
n e N.
An example of such an equation is provided by the factorial function j(n) = n\ where:
(n + l)l = (n + l)-n\
In this way, the value of f(n + 1) depends on both n and the value of f(n), and formally we would express this recurrence in the form F(x, y) = (x + \)y.
A very simple example is / (n) = C for some fixed scalar C and all n. Another example is the linear difference equation of first order
(1)
f(n + l) = a-f(n)+b,
where        and b are fixed numbers.
Such a difference equation is easy to solve if b = 0. Then it is the well-known recurrent definition of the geometric progression. We have
/(l) = a/(0),   /(2)=a/(l) = a2/(0),   and so on. Hence for all n we have
f(n) = a"/(0).
This is also the relation for the Malthusian population growth model. This is based on the assumption that population size grows with a constant rate when measured at a sequence of fixed time intervals.
We will prove a general result for first order equations with variable coefficients, namely:
(2)
f(n + 1) = an ■ j(n) + bn.
We use the usual notation for sum J2< the similar notation for the product Yl- We use also the convention that when the index set is empty, then the sum is zero and the product is one.
1.2.2. Proposition. The general solution of the first order difference equation (2) from the previous paragraph with the initial condition /(0) = yo is for n G N given by the formula
/n-l     \ n-2  I n-1 \
(1)   f(n) =    J] at   y0 + £W   J] aA bJ + bn-i-
3=0
=3 + 1
11
CHAPTER 1. INITIAL WARM UP
The new line has at most n intersections with the already-present n lines. The segment of the line between two intersections crosses exactly one region, thus the new line crosses at most n+l regions.
: po+i
i=l
Thus we obtain the recurrence relation
Pn+1 =pn+{n+ 1).
for which po = 1. We obtain an explicit formula for pn either by applying the formula in 1.2.2 or directly:
Pn = Pn-i + n = pn_2 + {n-l)+n
= pn-3 + (n - 2) + (n - 1) + n = ■
_       n(n + 1) _ n2 +n + 2 +      2      ~       2 '
□
Recurrence relations can be more complex than those of first order. We show an example of combinatorial problem, for whose solution a recurrence relation can be used.
I.B.5.   How many words of length 12 that consist only of <2JJj> letters A and B, but do not contain a sub-word BBB, are there?
Solution. Let an denote the number of words of length n consisting of letters A and B but without BBB as a sub-word. Then for an (n > 3) the following recurrence holds
«71  = «71-1 + «77-2 + «77-3,
since the words of length n that satisfy the given condition either end with an A, or with an AB, or with an ABB. There are a„_i words ending with an A (preceding the last A there
Proof. We use mathematical induction.   The result clearly holds for n = 1 since /(l) = a0y0 + fro-Assuming that the statement holds for some fixed n, we compute:
(/77-1       \ 77-2    /   77-1 \ n
I J\ ai I yo + Y I II fli J + hn-l \i=0     J j=0  \i=j+l     J j
+ bn
\ 77-1    /      77 \
n at) yo+( nat \b3[+bn>
\i=0    J j=0 \i=j+l J
as can be seen directly by multiplying out. □
Note that for the proof, we did not use anything about the numbers except for the properties of commutative ring.
1.2.3. Corollary. The general solution of the linear difference equation (l)from 1.2.1 with a/1 and initial condition
/(0) = yo is
(1)
1 — nn /(ti) = any0 +--b.
1 - a
Proof. If we set a{ and b{ to be constants and use the general formula 1.2.2(1), we obtain
f{n) = any0 + b(\ + Yjan->-1'
\ 3=0
We observe that the expression in the bracket is
(1 + aH-----ha"-1). The sum of this geometric progression follows from
l-a™ = (l-a)(l + a + ... + a™-1). □
The proof of the former proposition is a good example of a mathematical result, where the verification is quite easy, as soon as someone tells us the theorem. Mathematical induction is a natural method of proof.
Note that for calculating the sum of a geometric progression we required the existence of the inverse element for non-zero scalars. We could not do that with integers only. Thus the last result holds for fields of scalars and we can thus use it for linear difference equations where the coefficients a, b and the initial condition /(0) = yo are rational, real or complex numbers. This last result also holds in the ring of remainder classes with prime k (we will define remainder classes in the paragraph 1.6.7).
It is noteworthy that the formula (1) is valid with integer coefficients and integer initial conditions. Here, we know in advance that each j(n) is an integer, and the integers are a subset of rational numbers. Thus our formula necessarily gives correct integer solutions.
Observing the proof in more detail, we see that 1 — an is always divisible by 1 — a, thus the last paragraph should not have surprised us. However it can be seen that with scalars from Z4 and say a = 3, we fail since 1 — a = 2 is a divisor of zero and as such does not have an inverse in Z4.
12
CHAPTER 1. INITIAL WARM UP
can be an arbitrary word of length n — 1 satisfying the condition). Analogously for the two remaining groups. Further, it is easily shown that a± = 2, a2 = 4, and = 7. Using the recurrence relation we can then compute
a12 = 1705.
We could also derive an explicit formula for n-th element of the sequence using the theory, which we will develop in the chapter 3. □
LB.6. Partial difference equations. The recurrence relation in the next problem has a more complex form in comparison to the form we have dealt with in our theory. So we cannot evaluate the arbitrary member in our sequence P(k,i) explicitly. We can only evaluate it by a subsequent computing from previous elements. Such an equation is called partial difference equation, since the terms of the equation are indexed by two independent variables (k, I).
The score of a basketball match between the teams of Czech Republic and Russia after the first quarter is 12 : 9 for the Russian team. In how many ways could the score have developed?
Solution. We can divide all possible evolutions of the quarter with the final score k : I into six mutually exclusive possibilities, according to which team scored, and how much was it worth (1, 2 or 3 (<ý£) points). If we denote by P(k,i) the number of ways in which the score could have developed for a quarter that ended with k : I, then for k,l > 3 the following recurrence relation holds:
(k,i)
P(k-3,l) + P(k-2,l) + P(k-l,l) + P(k,l-
+
P(k,i-2) + P(k,i-3)- Using the symmetry of the problem,
(k,i)
P(i,k)- Further, for k > 3:
P(k,2) = P{k-i,2) + P(k-2,2) + P(k-1,2) + P(k,l) + P(k,0), P(k,l) = P(k-3,1) + P(k-2,1) + P(k-l,l) + P(k,0), P{k,0) = P{k-3,0) + P(k-2fi) + P(k-l,0)i
The linear difference equation 1.2.1(1) can be neatly in-igu        terpreted as a mathematical model for finance, c'^S^jF'" e-g- savings or loan payoff with a fixed interest rate a and fixed repayment b. (The cases of sav-Vfliti     ings and loans differ only in the sign of b). With varying parameters a and b we obtain a similar model with varying interest rate and repayment. We can imagine for instance that n is the number of months, an is the interest rate in the nth month, bn the repayment in the nth month.
1.2.4. A nonlinear example. When discussing linear difference equations, we mentioned a very primitive population growth model which depends directly on the momentary population size p. At firstsight.it is clear that such a model with a > 1 leads to a very rapid and unbounded growth.
A more realistic model has such a population change Ap(n) = p(n + 1) — p(n) only for small values of p, that is Ap/p ~ r > 0. Thus if we want to let the population grow by 5% for a time interval only for small p, then we choose r to be 0.05. For some limiting value p = K > 0 the population may not grow. For even greater values it may even decrease, for instance if the resources for the feeding of the population are limited, or if individuals in a large population are obstacles to each other etc.
Assume that the values yn = Ap(n)/p(n) change linearly in p(n). Graphically we can imagine this dependence as a line in the plane of the variables p and y. This line passes through the point [0, r], so that y = r when p = 0 This line also passes through [K, 0], since this gives the second condition, namely that when p = K the population does not change. Thus we set
By setting y = yn = Ap(n)/p(n) and p = p(n) we obtain
p(n + 1) —p(n) p{n)
K
p(n) + r.
By multiplying, we obtain a difference equation of first order with p(n) present as both a first and a second power.
(1)
p(n + 1) = p(n) (l - r^p(n) + r)
Try to think through the behaviour of this model for various values of r and K. In the diagram we 11 can see the results for parameters r = 0.05 (that is, five percent growth in the ideal state), K = 100 (resources limit the population to the size 100), and as p(0) = 2 we have initially two individuals.
13
CHAPTER 1. INITIAL WARM UP
which, along with the initial condition, gives P(o,o) = 1. P{i,o) = 1. ^(2,0) = 2, P(3,o) = 4, = 2, P(2jl) =
P(l,l) + P(0,1) + P(2,0)   =  5, P(2,2)   =  P(0,2) + P(l,2) +
P(2,i) + P(2,o) — 14- Hence by repeatedly using the above equations, we obtain eventually
P(12g) = 497178513.
We will discuss recurrent formulas (difference equations) of higher order with constant coefficients in chapter 3.
C. Combinatorics
In this section we use natural numbers to describe some indivisible items located in real life space, and deal with questions as how to compute the number of their (pre)orderings, choices, and so on. In many of these problems, "common sense" is sufficient. We just need to use the rules of product and sum in the right way, as we show in the following examples:
l.C.l. Mother wants to give John and Mary five pears and six apples. In how many ways can she divide the fruits among them? (We consider the pears to be indistinguishable. We consider the apples to be indistinguishable. The possibility that one of the children gets nothing is not excluded.)
Solution. The five pears can be divided in six ways (it is determined by the number of pears given to John, the rest goes to Mary.) The six apples can be divided in seven ways. These divisions are independent. Using the rule of product, the total number is 6 ■ 7 = 42. □
l.C.2. Determine the number of four-digit numbers, which either start with the digit 1 and do not end with the digit 2, or that end with the digit 2 but do not start with the digit 1 (of course, the first digit must not be zero).
Solution. The set of numbers described in the statement consists of two disjoint sets. The total number is then obtained by summing the number of numbers in these two sets. In the first set there are numbers of the form "1XXY" where X is an arbitrary digit and Y is any digit except 2. Thus we can choose the second digit in ten ways, independently of that the third digit in ten ways and again independently the fourth digit in nine ways. These three choices then uniquely determine a number. By multiplication, there are 10 ■ 10 ■ 9 = 900 of such numbers. Similarly in the second set we have 8-10-10 = 800 numbers of the form "YXX2" (for the first digit we have only
NM-LIN&\F^ "PETeNVE-NCE—
fa
bo V
0
So
P"      iro Zoo
Note that the original almost exponential growth slows down later. The population size approaches the desired limit of 100 individuals. For p close to one and K much greater than r, the right side of the equation (1) is approximately p(n)(l + r). That is, the behaviour is similar to that of the Malthusian model. On the other hand, if p is almost equal to K, the right side of the equation is approximately p(n). For an initial value of p greater than K the population size will decrease. For an initial value of p less than K the population size will increase.1
3. Combinatorics
A typical "combinatorial" problem is to count in how many ways something can happen. For instance, in how many ways can we choose two different sandwiches from the daily offering in a grocery shop?
In this situation we need first to decide what we mean by different. Do we then allow the choice of two "identical"' sandwiches? Many such questions occur in the context of card games and other games.
The solution of particular problems, usually involves either some multiplication of particular results (if the individual possibilities are independent) or some addition (if their appearance is disjoint). This is demonstrated in many examples in the problem column (cf. several problems starting with l.C.l).
1.3.1. Permutations. Suppose we have a set of n (distinguishable) objects, and we wish to arrange them in some order. We can choose a first object in n ways, then a second in n — 1 ways, a third in n — 2 ways, and so on, until we choose the last object for which there is only one choice. The total number of possible arrangements is the product of these, hence there are exactly n\ = n(n — l)(n — 2)... 3 ■ 2 ■ 1 distinct orders of the objects. Each ordering of the elements of a set S is called a permutation of the elements of S. The number of permutations on a set with n elements is n\.
^This model is called the discrete logistic model. Its continuous version was introduced already in 1845 by Pierre Francois Verhulst. Depending on the proportions of the parameters r, K and p(0), the behaviour can be very diverse, including chaotical dynamics. There is much literature on this model.
14
CHAPTER 1. INITIAL WARM UP
eight ways, since the number cannot start with zero and one is forbidden). By addition, the solution is 900 + 800 = 1700 numbers. □ In the following examples we will use the notions of combinations, and permutations (possibly with repetitions).
I.C.3. During a conference, 8 speakers are scheduled. Determine the number of all possible orderings in which two given speakers do not speak one right after the other. Solution. Denote the two given speakers by A and B. If B follows directly after the speaker A, we can consider it as a speech by a single speaker AB. The number of all orderings where B speaks directly after A is therefore 71, the number of permutations of seven elements. By symmetry, the number of all orderings where A speaks directly after B is also 71. Since the number of all possible orderings of eight speakers is 8!, the solutionis 8! - 2 ■ 7!. □
l.C.4. How many rearrangements of the letters of the word PROBLEM are there, such that
a) the letters B and R are next to each other,
b) the letters B and R are not next to each other.
Solution, a) The pair of letters B and R can be assumed to be a single indivisible "double-letter". In total we have six distinct letters and there are 6! words of six indivisible letters. We have to multiply this by two, since the double-letter can be either BR or RB. Thus the solution is 2 ■ 61.
b) The events in b) form the complement to the part a) in the set of all rearrangements of the seven-letters. The solution is therefore 7! - 2 ■ 6!. □
l.C.5. In how many ways can an athlete place 10 distinct cups on 5 shelves, given that all 10 cups fit on any shelf? Solution. Add 4 indistinguishable items, say separators, to the cups. The number of all distinct orderings of cups and separators is 141/4! (the separators are indistinguishable). Each placement of cups into shelves corresponds to exactly one ordering of cups and separators. It is enough to say that the cups before the first separator in the ordering are placed in the first shelf (preserving the order), the cups between the first and the second separator in the second shelf, and so on. Thus the required number is 141/4!. □
l.C.6. Determine the number of four-digit numbers with exactly two distinct digits. (Recall that the first digit must not beO.)
We can identify the elements in S by numbering them (using the digits from one to n), that is, we identify S with the set S = {1,..., n) of n natural numbers. Then the permutations correspond to the possible orderings of the numbers from one to n. Thus we have an example of a simple mathematical theorem and this discussion can be considered to be its proof.
Number of permutations
Proposition. The number p(n) of distinct orderings of a finite set with n elements, is given by the factorial function:
(1) p(n) = n\
Suppose S is a set with n elements. Suppose we wish to choose and arrange in order just k of the members of S, where 1 < k < n. This is called a k-permutation without repetition of the n elements. The same reasoning as above shows that this can be done in
„ i
v(n, k) = n(n - l)(n - 2) ■ ■ ■ (n - k + 1)
(n-fc)l
ways. The right side of this result also makes sense for k = Q, (there is just one way of choosing nothing), and for k = n, since 0! = 1.
Now we modify the problem, this time where the order of selection is immaterial.
1.3.2. Combinations. Consider a set S with n elements. A k-combination of the elements of S is a selection of k elements of S, 0 < k < n, when order jk$3*% does not matter. -Mr**B»-A~ por > i5 the number of possible results of a subsequential choosing of our k elements, is n(n — l)(n — 2) ■ ■ ■ (n — k + 1) (a fc-permutation). We obtain the same fc-tuple in k\ distinct orders. Hence the number of k-combinations is
n(n - l)(n - 2) ■ ■ ■ (n - k + 1) _ n\
k\ ~ [n-k)\k\
If k = 0, the same formula is still true, since 0! = 1, and there is just one way to select all n elements.
Combinations
Proposition. The number c(n, k) of combinations of k-th
degree among n elements, where 0 < k < n, is
(1)
fn\     n(n — 1). .. (n — k + 1) n\ \ k J
c(n, k)
k(k-l).
1
{n-k)\kV
We pronounce the binomial coefficient (£) as "n over k" or "n choose k". The name stems from the binomial expansion, which is the expansion of (a+b)n. If we expand (a+b)n, the coefficient of akbn~k is the number of ways to choose a
15
CHAPTER 1. INITIAL WARM UP
Solution. First solution. If 0 is one of the digits, then there are 9 choices for the other digit, which must also be the first digit. There are three numbers with a single 0, three numbers with two 0's, and just one number with three 0's. Thus there are 9(3+3+l)=63 numbers which contain the digit 0. Otherwise, choose the first digit for which there are 9 choices. There are then 8 choices for the other digit and 3+3+1 numbers for each choice, making 9 ■ 8 ■ (3 + 3 + 1) = 504 numbers which do not contain the digit 0. The solution is 504+63=567 numbers.
Second solution. The two distinct digits used for the number can be chosen in (g0) ways. From the two chosen digits we can compose 24 — 2 distinct four-digit numbers (we subtract the 2 for the two four digit numbers which use only one of the chosen digits). In total we have (3°) (24 - 2) = 630 numbers. But in this way, we have also computed the numbers that start with zero. Of these there are Q (23 — 1) = 63. Thus the solution is 630 — 63 = 567 numbers. □
l.C.7.   There are 677 people at a concert. Do some of them have the same (ordered) pair of name initials? Solution. There are 26 letters in the alphabet. Thus the number of all possible name initials are 262 = 676. Thus at least two people have the same initials. □
l.C.8. New players meet in a volleyball team (6 people). How many handshakes are there when everybody shakes once with everybody else? How many handshakes are there if everybody shakes hands once with each opponent after playing a match?
Solution. Each pair of players shakes hands at the introduction. The number of handshakes is then the combination c(6, 2) = (®) = 15. After a match each of the six players shakes hands six times (with each of six opponents). Thus the required number is 62 = 36. □
l.C.9. In how many ways can five people be seated in a car for five people, if only two of them have a driving licence? In how many ways can 20 passengers and two drivers be seated in a bus for 25 people?
Solution. For the driver's place we have two choices and the other places are then arbitrary, that is, for the second seat we have four choices, for the third three choices, then two and then 1. That makes 2.4! = 48 ways. Similarly in the bus we have two choices for the driver, and then the other driver plus the passengers can be seated among the 24 seats arbitrarily.
fc-tuple from n parentheses in the product (from these parentheses, we take a, from the others, we take b). Therefore we have
(2)
(a + b)" = £
k=0
lkbn~
Note that only distributivity, commutativity and associativity of multiplication and summation was necessary. The formula (2) therefore holds in every commutative ring.
We present a few simple propositions about binomial coefficients - another simple example of a mathematical proof. If needed, we define (£) = 0 whenever k < 0 or k > n.
1.3.3. Proposition. For all non negative integers n, we have
(D(t) = L-k) 0<fc<n. (2) (£J) = (I) + 0<fc<n-l.
(3) El
fc=o \k,
= 2™
(4) ELo*©="2"-1.
Proof. The first formula in the proposition is immediate directly from the formula 1.3.2(1). If we expand the right-hand side of (2), we obtain
71! 71!
+
71
k + 1
+
k\(n-k)\ (k + l)\(n - k - 1)\ (k + l)n! + (n - k)n\
(fc + l)!(n-(n + 1)!
k)\
(k + iy.(n-k)\
which is the left-hand side of (2).
In order to prove (3), we use mathematical induction ■ £> ,j; again. Mathematical induction consists of two •TjY steps. In the initial step, we establish the claim ssSsfet? for 71 = 0 (in general, for the smallest 71 the claim should hold for). In the inductive step we assume that the claim holds for some 71 (and all smaller numbers). We use this to prove that this implies the claim for 71 + 1. The principle of mathematical induction then asserts that the claim holds for every 71.
The claim (3) clearly holds for 71 = 0, since (°) = 1 = 2°. It holds also for 71 = 1. Now assume that the claim holds for some 71 > 1. We must prove the corresponding claim for 71 + 1 using the claims (2) and (3). We calculate
n+l
E
k=0
71+1 k
n+l
E
k=0
n
E
71
k-1
+
n+l
+ E
k=0
2n + 2™ = 2
n+l
Note that the formula (3) gives the number of all subsets of an 71-element set, since (£) is the number of all subsets of size k. Note also that (3) follows from 1.3.2(2) by choosing
a = b=l.
To prove (4) we again employ induction, as we did in (3). For 71 = 0 the claim clearly holds. The inductive assumption
16
CHAPTER 1. INITIAL WARM UP
First choose the seats to be occupied, that is, (21). Among these seats the people can be seated in 21! ways. The solution
is 2 ■
)21!
24'
=sp ways.
□
1.C.10. Determine the number of distinct arrangements (JJw2> which can arise by permuting the letters in each individual word in the sentence "Pull up if I pull up" (the arising arrangements and words do not have to make any sense). Solution. Let us first compute the number of rearrangements of letters in individual words. From the words "pull" we obtain 4!/2 distinct anagrams (permutation with repetition P(l, 1, 2)), similarly "up" and "if yields two. Therefore, using the rule of product, wehave^-2-2-1-^-2 = 1152. Notice, that if the resulting arrangement should be a palindromic one again, there would be only four possibilities. □
l.C.ll. In how many ways can we insert five golf balls into five holes (into every hole one ball), if we have four identical white balls, four identical blue balls and three identical red balls?
Solution. First solve the problem in the case that we have five balls of every colour. In this case it amounts to free choice of five elements from three possibilities (there is a choice out of three colours for every hole), that is permutations with repetitions (see). We have
V(3,5) = 35.
Now subtract the configurations where there are either balls of one colour (there are three such), or exactly four red balls (there are 2 ■ 5 = 10; we first choose the colour of the non-red ball - two ways - and then the hole it is in - five ways). Thus we can do it in
35 - 3 - 10 = 230 ways. □
1.C.12. In how many ways can we insert into three distinct envelopes five identical 10-bills and five identical 100-bills such that no envelope stays empty?
Solution. First compute the number of insertions ignoring the non-emptiness condition. It is an example of 3-combinations with repetition from 5 elements, and since we insert the 10-bills and 100-bills independently, we have c(7,2)2 = Q ways. Now subtract the insertions such that exactly one envelope is empty and then the insertions such that two are empty. We have
says that (4) holds for some n. We calculate the corresponding sum for n +1 using (2) and the inductive assumption. We obtain
£<T)=x>(G"i)+G
k=0    X '       k=0     X X ' X
n+1
k=0
-T.(l)+T.k(l)+T.\k
k=0 x   '       k=0     x   '       k=0 x
= 2™ + n2n-1 + n2n-1 = (n + 1)2™.
This completes the inductive step and the claim is proven for all natural n. □
The second property from above allows us to write down all the binomial coefficients into the Pascal triangle.2 Here, every coefficient is obtained as a sum of the two coefficients situated right "above" it:
n = 0
n = 1
n = 2
n = 3
n = 4
n = 5
1
1
1
10
1
10
1
Note that in individual rows we have the coefficients of individual powers in the expression (2). For instance the last given row says
(a + bf = a5 + 5a4b + Wa3b2 + Wa2b3 + 5ab4 + b5.
1.3.4. Choice with repetitions. The ordering of n elements, where some of them are indistinguishable, is called a permutation with repetitions.
Among n given elements, suppose there are Pi elements of the first kind, p2 elements of the second kind,pk of the fc-th kind, where pi + p2 + ■ ■ ■ + Pk = n. Then the number of permutations with repetitions of these elements is denoted as P(pi,... ,pk).
We consider the orderings which differ only in the order of indistinguishable elements to be identical. Elements of the ith kind can be ordered in pi! ways, thus we have
Permutations with repetitions
The number of permutations with repetitions is
P(pi
,Pk)
Pi!
'Pfc!
Let S be a set with n distinct elements. We wish to select k elements, 0 < k < n from S with repetition permitted. This is called a fc-permutation with repetition. Since the first selection can be done in n ways, and similarly the second can
Although the name goes back to Blaise Pascal's treatise from 1653, such a neat triangle configuration of the numbers c(n, k) were known for centuries earlier in China, India, Greece, etc.
17
CHAPTER 1. INITIAL WARM UP
C(2,7)2-3(C(l, 6)2 - 2) -3 = Q2-3(62 - 2)-3 = 336.
□
1.C.13. For any fixed n e N, determine the number of all solutions to the equation
xi+x2-\-----\- Xk = n
in the set of non-negative integers.
Solution. Every solution (ri,..., r^), Yli=i ri = n can be uniquely encoded as a sequence of separators and ones, where we first write r1 ones, then a separator, then r2 ones, then another separator, and so one. Such sequence then clearly contains n ones and k — 1 separator. Every such sequence clearly determines some solution of the given equation. Thus there are exactly that many solutions as there are sequences, that is, r^"1). □
1.C.14. In how many ways could the English Premier League have finished, if we know that no two of the three teams Newcastle United, Crystal Palace and Tottenham Hotspur are "adjacent" in the final table? (There are 20 teams in the league.)
Solution. First approach. We use the inclusion-exclusion principle. From the number of all possible resulting tables we subtract the tables where some two of the three teams are adjacent and then add the tables where all three teams are adjacent. The number is then '3'
20!
2! ■ 191+3! ■ 18! = 18! ■ 16-17.
Second approach. Let us consider the three teams to be "separators". The remaining teams have to be divided such that between any two separators there is at least one team. The remaining teams can be arbitrarily permuted, as can the separators. Thus we have '18'
ways.
17! ■ 3! = 18! ■ 17- 16.
D. Probability
□
We present a few simple exercises for classical probability, where we are dealing with some experiment with only a finite number of outcomes ("all cases") and we are interested in whether or not the outcome of the experiment belongs to a subset of possible outcomes ("favourable outcomes"). The probability we are trying to determine then equals the number
also be done in n ways etc. The total number V(n, k) of k-permutations with repetitions is nk. Hence
/^-permutations with repetitions
V{n,k)=nk. -
If we are interested in a choice of k elements without taking care of order, we speak of k-combinations with repetitions. At first sight, it does not seem to be easy to determine the number. We reduce the problem to another problem we have already solved, namely combinations without repetitions:
Combinations with repetitions
Theorem. The number of k-combinations with repetitions from n elements equals for every k > 0 and n > 1
'n + k-l^ k
Proof. Label then elements as a 1, a2, ■■ ■ ,an. Suppose each element labeled a{ is selected k{ times,
0 < k{ < k, so that k1+k2-\-----\-kn = fc.Each
such selection can be paired with the sequence of symbols * and | where each * represents one selection of an element and individual boxes are separated by (therefore there are n — 1 of them).
The number of * in the ith box is equal to k{, so we obtain the sequence
The other way around, from any such sequence we can determine the number of selections of any element (e.g. the number of * before first | determines k{).
Having altogether k symbols * and n — 1 separators | we see that there are
n + k-1 n-l
n + k-1 k
possible sequences and therefore also the same number of the required selections. □
4. Probability
Now we are going to discuss the last type of function description, as listed in the very end of the subsection 1.1.5. Thus, instead of assigning explicit values of a function, we shall try to describe the probabilities of the individual options.
1.4.1. What is probability? As a simple example we can use common six-sided dice throwing, with sides labelled as
1, 2, 3, 4, 5, 6.
18
CHAPTER 1. INITIAL WARM UP
of favourable outcomes divided by the total number of all outcomes. Classical probability can be used when we assume, or know, that each possible outcome has the same probability of happening (for instance, fair dice throwing).
l.D.l. What is the probability that the roll of a dice results in a number greater than 4?
Solution. There are six possible outcomes (the set {1,2,3,4,5,6}). Two are favourable ({5,6}). Thus the probability is 2/6 = 1/3. □
l.D.2. We choose randomly a group of five people from a group of eight men and four women. What is the probability that there are at least three women in the chosen group? Solution. We divide the favourable cases according to the number of men in the chosen group: there can be either two or one. There are eight groups with five people of which one is a man (all women have to be present in such groups, thus it depends only on which man is chosen). There are c(8, 2)-c(4,3) = (*)-(3) of groups with two men (we choose two men from eight and then independently three women from four. These two choices can be independently combined and thus using the rule of product we obtain the number of such groups). The total number of groups with five people is c(12,5) = (g2). The probability, being the quotient of the number of favourable outcomes to the total number of outcomes, is then
8 +
C52)
5_ 33'
l.D.3. From a deck with 108 cards (2 x 52 + 4 jolly jokers) we draw without returning 4 cards randomly. What is the probability that at least one of them is an ace or a joker? Solution. We can easily determine the probability of the complementary event, that is, in the 4 drawn cards there is none of the 12 cards (8 aces and 4 jokers). This probability is given by the ratio of the number of choices of 4 cards from 96 and the number of choices of4 cards from 108, that is, (946)/(1°8). The complementary event thus has the probability
1 -
(T)
pisy = °-380-
□
We give an example for which the use of classical probability is not suitable:
If we describe the mathematical model of such throwing with a "fair" dice, we expect by symmetry that every side occurs with the same frequency. We say that "every side occurs with the probability 1/6".
But throwing some less symmetric version of a dice with six faces, the actual probabilities of the individual results might be quite different. Let us build a simple mathematical model for this. We shall work with the parameters pi for the probabilities of individual sides with two requirements. These probabilities have to be non-negative real numbers and their sum is one, i.e.
Pi + P2 + P3 + Pi + P5 + P6 = 1-
At this time, we are not concerned about the particular choice of the specific values pi, they are given to us. Later on, in chapter 10, we shall link probability with mathematical statistics and then we shall introduce methods how to discuss reliability of such a model for a specific real dice.
1.4.2. Classical probability. Let us come back to the mathematical model for the fair dice. We consider the sample space fl = {1, 2,3,4,5,6} of all possible elementary events (each of them corresponding to one possible result of the experiment of throwing the dice). Then we can consider any event as a given subset A of fl. For example A = {1,3, 5} describes the result of getting odd number on the resulting side (we count the labels on the sides of the dice). Similarly, the set B = Ac = {2,4, 6} = fl \ A is the complementary event of getting even numbered points. The probability of both A and II will be 1/2. Indeed, = 1/2, where \ A\ means
the number of elements of a set A.
This leads to the following obvious generalization:
Classical probability
Let fl be a finite set with n = \ fl\ elements. The classical probability of the event corresponding to any subset A c fl
□ .   is defined as
P(A)
\A\ W\
Such a definition immediately allows us to solve problems related to throwing several fair dice simultaneously. Indeed, we may treat this as throwing independently one dice many times and thus multiplying the probabilities. For example, the event of getting an odd sum of points on two dice is given by adding the probabilities of having an even number on the first one and odd number on the second one and vice versa. Thus the probability will be twice 1/2 ■ 1/2, which is 1/2 as expected.
1.4.3. Probability space. Next, we formulate a more general concept of probability covering also the unfair dice example above.
We shall need a finite set fl of all possible states of a system (e.g. results of an experiment), which we call the sample space.
19
CHAPTER 1. INITIAL WARM UP
l.D.4. What is the probability that the reader of this exercise wins at least 25 million euro in EuroLotto during the next week?
Solution. Such a formulation is incomplete, it does not give us enough information. We present a "wrong" solution. The sample space of possible outcomes is two-element: either the reader wins or not. A favourable event is one (win), thus the probability is 1/2. This is clearly a wrong answer. □ Remark. In the previous exercise the basic condition of the use of classical probability was violated - every elementary event must have the same probability. In fact, the elementary event has not been defined. EuroLotto has a daily draw with a jackpot of €25 000 000 for choosing 5 correct numbers 1,..., 50. There is no other way to win €25 000 000 than to win a jackpot on some of the day during the week. The elementary event would be that a single lotto card with 5 numbers wins a jackpot. Assuming that the reader submits k lotto cards every day of the week, the probability of winning at least one jackpot during the week is -ply = 21:L78fc760 •
l.D.5. There are 2n seats in a row in a cinema. We randomly seat n men and n women in the row. What is the probability that no two persons of the same sex sit next to each other?
Solution. There are (2n)! possible seatings. The number of seatings satisfying the given condition is 2(n!)2. For we have two ways for choosing the positions for men (thus also for women) - either all men sit on odd-numbered places (thus the women sit on even-numbered places), or vice versa. Among these places, both men and women are seated arbitrarily. The resulting probability is thus
2(n!)2
p(n) =
(2n)
In particular,  p{2) = 0.33, p(5) = 0.0079, p(8) =
0. 00016. □
1. D.6. Five persons enter an elevator in a building with eight floors. Each of them leaves the elevator at any floor with the same probability. What is then the probability, that
i) all of them leave at sixth floor,
ii) all of them leave at the same floor,
iii) each of them leaves at a different floor.
Solution. The sample space of possible events is the space of all possible ways of leaving the elevator by 5 people. There are 85 of them.
Further, the space of all possible events is given as the set A of all subsets in fl. Finally, we need the function describing the probabilities of occurrence of individual events: Probability function
Let us consider a non-empty fixed sample space fl. event. The probability function P : A —> R satisfies
(1) P{fl) = 1
(2) 0 < P(A)   for all events A
(3) P(A U B) = P(A) + P(B)   whenever A n B = 0.
Notice that the intersection An B describes the simultaneous appearance of both events, while the union Au B means that at least one of events A and B appear. The event Ac = fi \ A is called the complementary event.
7(h u3)-?(A)??(^A)
There are some further straightforward consequences of the definition for all events A, B:
(4) P(A) = 1 - P(AC)
(5) P(0) = 0
(6) P(A) <1   for all events A
(7) P(A) < P(B)   whenever A c B
(8) P{A\jB) = P{A) + P{B)-P{Ar\B)
The proofs are all elementary. For example, A U (Ac) = fl and thus (3) implies (4).
Similarly, we can write A = (A \ B) U (A n B) and AuB = (A\B)u(B\A)u(AnB) with disjoint unions of sets on the right hand sides. Thus, P(A) = P(A\B)+P(AnB) and P(A U B) = P(A \B) + P(B \A) + (AnB) by (3), which implies the last equality. The remaining three claims are even simpler.
All these properties correspond exactly to our intuition how probability should behave. Probability should be always
20
CHAPTER 1. INITIAL WARM UP
drawn balls back). What is the probability that the fifth drawn ball is black?
Solution. We will solve a more general problem, the probability that the i-th drawn ball is black. This probability is the same for alH, 1 < i < 16 - we can imagine that we draw all balls one by one, and every such sequence (from the first drawn ball to the last one) consisting of five white, five red and six black has the same probability of being drawn. Thus we can use classical probability. There are P(5,5,6) = 5,1^!-6, of such sequences. The number of sequences where there is a black ball on the i-th place, the rest arbitrary, equals to the number of arbitrary sequences of five white, five red and five
are counted twice. But we must then add in the number of elements in the intersection of all three.
blackballs. That is, P(5,5,5)
P(5,5,5) _ 15!
15! 5!5!5!
Thus the probability
16!
P(5,5,6)     5!5!5!' 6!5!5!
□
1.D.10. Inclusion-exclusion principle. A secretary has to i'/, send six letters to six different people. She puts the
fr letters in the envelopes randomly. What is the probability that at least one person receives the correct intended letter?
Solution. We compute the probability of the complementary event - no person receives the correct letter.
The sample space corresponds to all possible ordering s of six envelopes. If we denote both the letters and the envelopes by numbers from one to six, then all the favourable events (no letter is assigned to the corresponding envelope) correspond to such ordering s of six elements, where the i-th element is not at the i-th place (i = l,...,6).
These are the orderings without a fixed point. We compute the number of such orderings using the inclusion-exclusion principle. If we denote by M, the set of permutations such that i is a fixed point (note that permutations in M{ can also have other fixed points), then the resulting number d of permutations without a fixed point is
d = 6!- |MiU---UM6|
We shall now follow the same idea in order to write down the formula in the following theorem. It seems plausible that such a formula should work with proper coefficients of the sums of probabilities of intersections of more and more events among A1,..., Ak, at least in the case of classical probability. The reader will perhaps appreciate that a quite straightforward mathematical induction will verify the theorem in full generality.
1.4.5. Theorem. Let A1,... ,Ak e A be arbitrary events over the sample space Q with a set of events A. Then
k k-1 k
p(u?=1^) = e p(A<) -EE    n Ai)
i=l i=l j=i-\-l
k-2  k-1 k
+ E E E PiAnA.nAe)
i=ij=i+ie=j+i
+ (-i)k-1P(A1nA2n---nAk).
Proof. For k = 1 the claim is obvious. The case k = 2 is the same as the equality 1.4.3(8), which we have already proved.
Assume that the theorem holds for any number of events up to k, where k > 1. Now we can work in the induction step with the formula for k + 1 events, where the union of the first k of them are considered to be the A in the equation 1.4.3(8) and the remaining event is considered to be the B:
P(U™Ai) = P((ut=1Ai)uAk+1)
= E((-i)J'+1     E J^n-.-n^
j=l ^ l<i1< — <ij<k
+ P(Ak+1) - P((yl1 U--U4)fl Ak+1).
This already resembles the formula for k + 1 summed events. But in the first term, expressions containing Ak+i are missing. Also absent is a term allowing for the probability that all the
22
CHAPTER 1. INITIAL WARM UP
The number of elements in the intersection mtl n • • • n mik, k = 1,..., 6, is (6 — k)\ (the order of the elements i\,..., ik is fixed, the remaining 6—k can be ordered arbitrarily). Using the inclusion-exclusion principle we have
|MX U ■ ■ ■ U M6| = (f) (6 - k)\
k=l ^ '
and thus for the number d we obtain the relation
k\
k=0 x  ' k=0
The probability that no person receives "his" letter is then
k=0
k\
53 144'
The probability we were asked for is
k=0
k\
91 144'
□
Remark. Notice that the answer does not change much with a growing number of letters. For n letters, the probability that the secretary does not assign any of them in correct order is
" (-1)^1
k=0
k\
As we will see later, the sum converges to the value 1/e. In a similar way the exercise 1.G.45 can be solved.
The following exercise is a simple model, which estimates the probability of death of a person in a traffic accident.
l.D.ll.   Approximately 1200 persons die per year at the A /1^<f^S^   roads of the Czech Republic. Determine / f4Ex3« NT   \ tne probability that some person of a cho-y ^? ^ group of 500 people dies in the fol-
lowing ten years in a traffic accident. For simplicity, assume that every person has the same "chance" of dying in traffic accident in one year and that this probability is 1200/107.
Solution. Let us first count the probability that one randomly chosen person does not die in ten years in a traffic accident. The probability that he/she does not die in a year is (1 — j^s) ■ The probability that he/she does not die in ten years is then (1 — ~s)10 ■ The probability that in ten years none of the given 500 people does not die is again using the product rule (the events are independent) (1 — jks)5000. The probability of the
events happen. On the other hand, the last expression should not be there. We can replace it by the expression
-p{{ax n ak+1) u ■ ■ ■ u (ak n ak+1))
and for this we can again use the induction, that is, the formula in the statement of the theorem. With a little patience (and a piece of paper long enough to write down all the expressions) we can check that this adds all the missing pieces. □
1.4.6. Inclusion-exclusion principle. As we have mentioned already, a special case of the previous theorem is the one of classical probability. There the probability of an event a is strictly proportional to the number of elements in a (which is just divided by the total size n of the sample space). Thus, in the formula from the previous theorem, all the probabilities give the sizes of the subsets involved, up to a common factor ^.
In this way we can extract from the theorem 1.4.5 the following claim for the size of a general finite set m and its subsets a1,..., ak. As usual we let \m\ denote the number of elements of the set m.
Of course for every finite set m and its subsets,
\M\(U$=1Ai)\ = \M\-\U$=1Ai\.
Now we use the previous theorem, and express the size of the union on the right side, and we obtain the theorem that is usually called
Principle of inclusion-exclusion
|M\(uJli^)| = \m\ k ,
j=i ^ i<j1<---<j3<fc
The meaning of this result for the special case n = 3 can be visualized easily, see the diagram before the theorem 1.4.5.
1.4.7. Independent events. Next, we wish to express possible dependencies among events in a given sample space fi with the probability function p. We say that the events a and b are stochastically independent if
p(a n b) = p(a) ■ p(b).
This definition may remind us of our experiences in combinatorics when counting possibilities for independent choices. For example, dealing with a fair dice, we can define events a "odd number occurs", b "the result is at least 3" and c "the result is at most 3". The probabilities are
p(a) = i, p(b) = §, p(g) = i, p(a n b n o =
I = p[a) ■ p(b) ■ p(g), and taking pairs we
have p(a nC) = ^
\, p(a n b) = \ = \
p(b n c) = | ^ | ■ \. Notice, that the stochastical de-pendance of the pairs a, c and b, c corresponds well to our
23
CHAPTER 1. INITIAL WARM UP
complementary event, that is, some of the chosen people dies, is then
12
□
1-1
5000
0.4512.
Remark. The model we have used in the previous exercise to describe the given situation is only approximate. The complication is in the condition that every person in the sample has the same probability of dying, which is derived based on the total number of deaths per year. But the number of deaths changes yearly and even if it did not, the population changes. We show one of the possible inaccuracies by a different approach to the solution: if 1200 persons per year dies, then in ten years 12000 persons die. The probability that a certain person dies in ten years can thus be estimated by 12000/107. The probability that a specific person does not die in ten years is then (1 — j^) (first two members of binomial expansion of (1 — j§^)10)- In total we analogously obtain the estimate of the probability
1-U
12 1Ö1
500
0.4514.
We see that both estimates are very close to each other.
The effort to use mathematical knowledge for winning in various gambling games is very old. We look at a very simple example.
l.D. 12. Alex has $ 2500 left over from organizing a summer camp. Alex added $50 from his savings and decided to go playing roulette. Alex bets only on colour. The probability of winning when betting on colour is 18/37. He begins to bet $10, and if he loses, in the next bet, he doubles the bet that he made in the previous round. (This is only if he has enough money. If not, he ends the game even if he has some money left.) Ifhewins,inthenextroundhebetsagain$10. Whatis the probability that using this strategy he wins another $2550? As soon as he has already won such an amount, he ends the game.
Solution. First count how many times in a row Alex can loose. If he begins with a bet of $10, then for n bets he needs
10 + 20 + ■ ■ ■ + 10 ■ 2n~1 = 10 ■ ^X 2^j = 10 ■ (2n - 1).
The number 2550 is of the form 10(2™ - 1) for n = 8. Alex can thus bet eight times in a row no matter what the result is. For nine bets he would need 10(29 — 1) = $5110 and during the game he will never have such an amount, since as soon
intuition (e.g. there are more odd numbers between the values 1,2,3 than between the numbers 4,5, 6).
This example also shows that we have to be careful with more events. In general, mutually independent sets are defined in this way:
Definition. Consider an arbitrary probability space (fi,A,P) and k events A1,...,Ak in that space. We say that these events are stochastically independent (with respect to the probability function P), if for any chosen events A{l,..., Ain 1 < £ < k we have
P(An n---nAit) = P(Ail)
Every subset of a set of stochastically independent events is also stochastically independent. Further, for any two stochastically independent events we compute
P(AnPr) = P(A\B) = P(A) - P(AnB) =
= P(A)(1 - P(B)) = P{A)P{BC).
From there we can show that by exchanging one or more events in a set of stochastically independent events by their complements, we again obtain a set of stochastically independent sets.
Sometimes we need to compute the probability that at least one of the stochastically independent set of events occurs. That is, we want to compute P(A\ U ■ ■ ■ U Ak). In such a situation we can use the De Morgan laws for sets,
(yj,eIA,)c = n,eIAct (n,eIA,)c = uJG74c.
We obtain
P(A! U ■
■■UAk)
1-(1-
= i-P(Ac1n---nA%)
P(A1))-...-(l-P(Ak))
= l-Y[(l-P(Aj
1.4.8. Conditional probability. Often we want to restrict our attention only to events, which lie in a sub-space H C n. This means that the events in question will be the intersections A n H of the original events A with the subset H. Thus our new probabilities should be proportional to P(An H). We would like to have H in the role of the new sample space.
As an example, we might look again at the model of a fair dice and ask the question "what is the probability that by throwing two dice the result is twice 5, if we know that the sum of the results is 10?". Of course, we are now having only the possibilities 4 + 6 (two times) and 5 + 5 (once). So the probability should be |, much greater then the probability ^ of the same event without any further condition.
Similar situations are reflected in the following definition:
24
CHAPTER 1. INITIAL WARM UP
as he has $5100, he stops. Thus in order for him to fail, he must lose eight times in a row. The probability of losing on a single bet is 19/37. So the probability of losing eight times in a row is (19/37)8. The probability that in these eight games he wins $ 10 (using his strategy) is thus 1—(19/ 37)8. In order to win $2500, he needs to win 255 times $10. Again using the product rule the probability of winning is
1-1^ 37
0.29.
Thus the probability of winning is lower than betting everything at once on colour. □
1.D.13. Individually you can try to solve the previous exercise assuming that Alex has the same strategy as before, but ends only when he has no money (if he cannot afford to double the bet when he lost the previous but still has some money, he begins again with $10).
Now we consider "conditional" probability (see (1.4.8)).
1.D.14. What is the probability that when rolling two dice the sum is 7, if we know that neither of the rolls resulted in a 2?
Solution. Let B be the event that neither of the rolls results into 2, and let A be the event "sum is 7". The set of all possible outcomes is again denoted by fl. Then
P(AnB)    ^Sf \AnB\
P(A\B)
P(B)
\n\
\B\
The number 7 can appear as a sum in four ways if there is no 2, that is, | A n B\ = 4, |5| = 5 ■ 5 = 25. Thus
4
P(A\B)
25'
Note that P(A) = ^, that is, A and B are not independent.
□
1.D.15. Michael has two mailboxes, one at gmail.com and ^ Jj^-Jg, the other at hotmail.com. His username is the same at both servers, but the passwords are different. He does not remember which password corresponds to which server. When typing in the password for accessing his mailbox, he makes a typo with probability 5% (that is, if he tries to type in a specific password, he types what he intended with probability 95%). At the server hotmail.com, Michael typed in the username and a password, but the server told him that something is wrong. What is the probability that he chose the correct password but
Conditional probability
Definition. Let H be an event with non-zero probability in the sample space fl with the probability function P. The conditional probability P(A\H) of the event A given H is defined by the formula
P(AnH)
P{A\H)
P{H)
The event H is sometimes called the hypothesis.
As it is obvious from the definition, the hypothesis H with non-zero probability and the event A are (stochastically) independent if and only if P(A) = P(A\H). The definition also directly implies the "theorem for product of probabilities" - if we have two events A\, A2 satisfying P(A\ n A2) > 0, then
P(A1 n A2) = P(A2)P(A1\A2) = P(A{)P(A2\A{).
All these numbers express (in a different manner) the probability that both events Ai and A2 occur. For instance, in the last case we first look whether the first event occurred. Then, assuming that the first has occurred, we look whether the second also occurs. Similarly, for three events A\, A2, A3 satisfying P(A1 n A2 n A3) > 0 we obtain
P(A! nA2n A3) = p(^i)P(^2|^i)P(^3|^i n A2).
The probability that three events occur simultaneously can be computed as follows. Compute the probability that the first occurs, then compute the probability that the second occurs under the assumption that the first has occurred. Then compute the probability that the third occurs under the assumption that both the first and the second have occurred. Finally, multiply the results together.
In general, if we have k events A1,... ,Ak satisfying P(A1 n ■ ■ ■ n Ak) > 0, then the theorem says
P(Axc\- ■ -r\Ak) = p(Ax)P(A2\Ax)- ■ -p^fcl^in- ■ -n^-i
Notice that our condition that P(A1 n • • • n Ak) > 0 implies that all the hypotheses in the latter formula have got non-zero probabilities and thus all the conditional probabilities make sense. Indeed, each A{ is at least as big as the intersection and thus its probability is at least as big, thus non-zero, see 1.4.3(7).
1.4.9. Geometric probability.
In practical problems, the sample space may not be a finite set. The set A of all events may not be the entire set of all subsets in fl. To generalise probability to such situations is beyond our scope now, but we can at least give a simple illustration.
Consider the plane R2 of pairs of real numbers and a subset fl with known area fl. Events are represented by subsets A c fl. For the event set A we consider some suitable system of subsets for which we can determine the area. An event A then occurs if a randomly chosen point from fl belongs to
25
CHAPTER 1. INITIAL WARM UP
just mistyped? (Assume that the username is always typed correctly and that making a typo cannot turn wrong password into a good one.)
Solution. Let A be the event that Michael typed in a wrong password at hotmail.com. This event is the union of two disjoint events:
A1 : he wanted to type in the correct password and mistyped, A2 : he wanted to type in the wrong password (the one from gmail.com) and either mistyped it or not.
We are looking for a conditional probability P(A\ \A) which, according to the formula for conditional probability, is:
PiAM)-p{AinA) -   p{Al)   - p{Al)
P(A) P(A1UA2) P(A1) + P(A2) where we have used P(A1 U A2) = P{Ai) + P{A2) since A1 and A2 are disjoint. We just need to determine the probabilities P(Ai) and P(A2). The event A1 is the intersection of two independent events: Michael wanted to type in a correct password and Michael mistyped. According to the problem statement, the probability of the first event is 1/2 and the probability of the second event is 1/20. In total P(Ai) = i ■ i = _L (we multiply the probabilities, since the events are independent). Further we have (directly
from the problem statement) P(A2
In total P(A)
P(yli) + P(A2) = i _,_ I = |i. We can evaluate
P(A)
40
1
21'
□
The method of geometric probability can be used in the case that the given sample space consists of some region of a line, region, space (where we can determine (respectively) length, area, volume,...). We assume that the probability, is equal to the ratio of the area of the subregion to the area of the sample space.
1.D.16. From Edinburgh Waverley station trains depart ev-ery hour (in the direction to Aberdeen). From Aberdeen to Edinburgh they also depart every hour. Assume that the trains move between these two stations with an uniform speed 72 km/h and are 100 meters long. The trip takes 2 hrs in either direction. The trains meet each other somewhere along the route. After visiting an Edinburgh pub, John, who lives in Aberdeen, takes the train home and falls asleep at the departure. During the trip from Edinburgh to Aberdeen he wakes up and
the subregion determined by A, otherwise the event does not occur.
Consider the problem of randomly choosing two numbers a < b in the interval [0,1] C K. All values a and b are chosen with equal probability. The question is "what is the probability that the interval (a, 6) has length at least one half?" The choice of points (a, 6) is actually the choice of a point [a, b] inside of the triangle fl with vertex points [0,0], [0,1], [1,1] (see the diagram).
We can imagine this as a description of a problem where a very tired guest at a party tries to divide a sausage with two cuts into three pieces for himself and his two friends. What is the probability that the middle part will be at least half of the sausage?
Thus we need to determine the area of the subset which corresponds to points with b > a + |, that is, the interior of the triangle A bounded by the points [0, §], [0,1], [§, 1]. We findP(^) = (l/8)/(l/2) = i.
Similarly, if we ask for the probability that some of the three guests will get at least half of the sausage, then we have to add the probabilities of two other events: B saying a > 1/2 and C given as & < 1/2. Clearly they correspond to the lowest and the most right top triangles and thus they have got probabilities 1/4, too. Thus the requested probability is 3/4. Equiv-alently we could have asked for the complementary event "all of them get less than a half which clearly corresponds to the middle triangle and thus has probability 1/4.
Try to answer on your own the question "what is the minimal prescribed length £ such that the probability of choosing an interval (a, 6) of length at least £ is one half?"
1.4.10. Monte Carlo methods. One efficient method for jjV 11 computing approximate values is simulation by the relative occurrence of a chosen event.
We present an example. Let fl to be the unit square with vertices at [0,0], [1,0],[0,1], and [1,1]. Let A be the intersection of fl with the unit disk centred at the origin. Then area A = \n. Suppose we have a reliable generator of random numbers a and b between zero and one. We then compute relative frequencies of how often a2 + b2 < 1.
26
CHAPTER 1. INITIAL WARM UP
randomly sticks his head out of the train for five seconds, on the side of the train where the trains travel in the opposite direction. What is the probability that he loses his head? (We are assuming that there are no other trains involved.) Solution. The mutual speed of the oncoming trains is 40 metres per second, the oncoming train passes John's window for two and a half seconds. The sample space of all outcomes is thus the interval (0,7200). During John's trip two trains pass by John's window in the opposite direction. Any overlap of the 2.5 seconds of the passing time interval with the 5 second time interval when John's head might be sticking out is fatal. Thus, for each train, the space of "favourable" outcomes is an interval of length 7.5 seconds somewhere in the sample space. For two trains, it is double this amount. Thus the probability of losing the head is 15/7200 = 0.002. □
1.D.17. In a certain country, a bus departs from town A to town B once a day at a random time between eight a.m. and eight p.m. Once a day in the same time interval another bus departs in the opposite direction. The trip in either direction takes five hours. What is the probability that the buses meet, assuming they use the same route?
Solution. The sample space is a square 12 x 12. If we denote the time of the departure of the buses as x and y respectively, then they meet on the trail if and only if \x — y < 5. This inequality determines the region in the square of "favourable events". This is a complement to the union of two right-angled isosceles triangles with legs of length 7. Its area in total is 49, so the area of the "favourable part" is 144—49 = 95. The probability is p
— = 0 66 144 u-uu-
□
1.D.18. A rod of length two meters is randomly divided into three parts. Determine the probability that at least one part is at most 20 cm long.
Solution. Random division of a rod into three parts is given by two points of the cut, x and y (we first cut the rod in the
That is, that [a, 6] e A. Then the result (after a large number of attempts) should approximate the area of a quarter unit circle, that is 7r/4 quite well.
./img/0214_eng.png
Of course, the well-known formula for the area of a circle with radius r is 7rr2, where it = 3.14159.... It is an interesting question - why should the area of a circle be a constant multiple of the square of its radius? We will be able to prove this later. Experimentally, we can hint at this by the approach as above using squares of different sizes.
Numerical approaches based on such probabilistic principle are called Monte Carlo methods.
5. Plane geometry
So far we have been using elementary notions from the geometry of the real plane in an intuitive way. Now we will investigate in more detail how to IS^g deal with the need to describe "position in the plane" and to find some relation between positions of distinct points in the plane.
Our tools will be mappings. We will consider only mappings which, to (ordered) pairs of values (x, y), assign pairs (w, z) = F(x, y). Such a mapping will consist of two functions w(x, y) and z(x, y), each depending on two arguments x, y. This will also serve as a gentle introduction to the part of mathematics called Linear algebra, with which we will deal in the subsequent three chapters.
1.5.1. Vector space R2. We view the "plane" as a set of pairs of real numbers (x, y) e R2. We will call these pairs vectors in R2. For such vectors we can define addition "coordinate-wise", that is, for vectors u = (x,y) and v = (x',y') we set
u + v = (x + x', y + y').
Since all the properties of commutative groups hold for individual coordinates, these hold for our new vector addition too. In particular there exists a zero vector 0 = (0,0), such that v + 0 = v. We use the same symbol 0 for the vector and for the number zero on purpose. The context will always make it clear which "zero" it is.
27
CHAPTER 1. INITIAL WARM UP
distance x from the origin, we do not move it and again cut it in the distance y from the origin). The sample space is a square C with side 2 m. If we place the square C so that its two sides lie on axes in the plane, then the condition that at least one part is at most 20 cm determines in the square a subregion O:
0 = {(x,y) e C\x < 20 Wx > 180 Vy < 20 V y > 180 V \x - y\ < 20}.
As we observe, this subregion has area times the area of the square.
TWO litTEZ. 2AK
20z 1°0
16   1g 10
E. Plane geometry
□
Let us start with several standard problems related to lines in plane:
l.E.l.   Write down the general equation of the line p : x =
2-t,y = l + 3t,teR.
Solution. By eliminating t, the solution is 3a; + y — 7 = 0.
□
l.E.2.   We are given a line
p : [2,0] +i(3,2), t e R.
Determine the general equation of this line. Determine its intersection with the line
q : [-1,2] + s(l,3), s G R.
Solution. The coordinates of the points on the first line are given by the parametric equations as a; = 2 + 3t and y = 0 + 2t. By eliminating t from the equations we obtain the equation:
2a; - 3y - 4 = 0.
Next we define scalar multiplication of vectors. For a e
R and u = (a;, y) G R2, we set
a ■ u = (ax, ay).
Usually we will omit the symbol ■ and use the juxtaposition of the symbols a v to denote the scalar multiple of a vector.
We can directly check other properties for scalar multiplication by a or & and addition of vectors u and v. For instance
a (u + v) = au + av, (a + b)u = au + bu, a(bu) = (ab)u.
We use the same symbol + for both vector addition and scalar addition.
Now we take a very important step. Define vectors e\ = jj> ii (1, 0) and e2 = (0,1). Every vector can then be written uniquely as
u = (x,y) =xe1+ye2. The expression on the right is called a linear combinations of vectors e1 and e2. The pair of vectors e = (ei, e2) is called a basis of the vector space R2.
If we choose two non-zero vectors u, v such that neither of them is a multiple of the other, then they too form a basis of R2.
LlNtAl ICHbMKTION
■2m/ +nr+-
These operations are easy to imagine if we consider the vectors v to be arrows starting at the origin 0 = (0,0) and ending at the position (x, y) in the plane.
_ The addition of two such arrows is then
given by the parallelogram law: Given two arrows starting at the origin, their sum is the arrow given by the diagonal arrow (also starting at the origin), of the parallelogram with the two given arrows as adjacent sides. Multiplication by a scalar a corresponds to stretching the arrow to its a-multiple. This includes negative scalars, where the direction of the vector is reversed.
1.5.2. Points in the plane. In geometry, we should distinguish between the points in the plane (as for instance the chosen origin O above), and the vec-|jA tors as the arrows describing the difference be-3^*%»-^— tween two such points. We will work in fixed standard coordinates, that is, with pairs of real numbers, but for better usage we will always strictly distinguish vectors written in parentheses and denoted for a moment by bold face
28
CHAPTER 1. INITIAL WARM UP
We obtain the intersection of p with the line q by substituting the points of q in parametric form into the equation for p:
2(-l + s)-3(2 + 3s)-4 = 0.
Here we obtain s = —12/7 and from the parametric equation of q we obtain the coordinates of the intersection P:
[_ 19 _22"
L 7' 7.'
□
l.E.3.   Determine the intersection of the lines
p:x + y-4 = 0,    q:x = -l + 2t, y = 2 + t, teR.
Solution. Eliminate t to obtain q : x — 2y = — 5. Then solve for x and y. The intersection has coordinates x = 1, y = 3.
□
I.E.4. Find the equation of the line p, which goes through the point [2, 3] and is parallel with the line x — 3y + 2 = 0. Find a parametric equation of the line q which goes through the points [1,3] and [—2,1].
Solution. Every line parallel to the line x — 3y + 2 = 0 is given by the equation
x — 3y + c = 0
for some eel, Since the line q goes through the point [2, 3], c = 7 by putting x = 2 and y = 3. We can immediately give a parametric equation of the line q
q : [1, 3] + t (1 - (-2), 3 - 1) = [1, 3] + t (3, 2), teR.
□
l.E.5. Consider the following five lines. Determine if any two of the lines are parallel to each other.
Pi : 2x + 3y — 4 = 0,   P2 : x — y + 3 = 0,
3
p3 : -2x + 2y = -6,   p4 : -x - - y + 2 = 0, p5:x = 2 + t,y = -2-t, teR
Solution. It is clear that
-2 ■ (-x - | y + 2) = 2x + 3y - 4.
Thus pi and p4 describe the same line. p2 can be rewritten as —2x + 2y — 6 = 0, thus the lines p2 and p$ are parallel and distinct. By eliminating t, the line ps has an equation x + y = 0, which is not parallel to any other line. □
letters like u, v, instead of brackets (which we use for coordinates of points in the plane. Points are denoted by capital latin letters).
Even if we view the entire plane as pairs of real numbers in R2, we may understand adding two such couples as follows. The first couple of coordinates describes a point P = [x,y], while the other one denotes a vector u = (ui, u2). Their sum P + u corresponds to adding the (arrow) vector u to the point P. If we fix the vector u, we call the resulting mapping
P = [x,y] i-> P + u = [x + ui,y + u2]
the shift of the plane (or translation) by the vector u.
Thus, the vectors in R2 can be understood in more abstract way as the shifts in the plane (sometimes called the/ree vectors in elementary geometry texts).
The standard coordinates on R2, understood as pairs of real numbers are not the only ones. We can put a coordinate system on the plane with our choosing.
Coordinates in the plane R2
Choose any point in the plane, and call it the origin O. All other points P in the plane can be identified with the vectors (arrows) OP* with their tails at the origin.
Choose any point other than O and call it E1. This defines the vector ei = OE1 = (1,0). Choose any other point E2 so that O, Ei, E2 are, distinct and not collinear. This defines the vector e2 = OE2 = (0,1).
Then every point P = (a, 6) in the plane can be described uniquely as P = O + aei + be2 for real a, b, or in vector notation, 0$ = aei + be2.
Translation, by adding a fixed vector, can be used either to shift the coordinate system (including the origin), or to shift sets of points in the plane. Notice that the vector corresponding to the shift of the point P into the point Q is given as the difference Q — P (in any coordinates). Thus we shall also use this notation for the vector P$ = Q — P.
For each choice of coordinates, we have two distinct lines for the two axes. The origin is the point of intersection. Other way round, each choice of two non-parallel lines, together with the scales on each of them defines coordinates in the plane. They are called affine coordinates.
Clearly each nontrivial triangle in the plane with vertices O, Ei, E2 defines coordinates where this triangle is denned
29
CHAPTER 1. INITIAL WARM UP
I.E.6. Determine the line p which is perpendicular to the line q : Qx — 7y + 13 = 0 and which goes through the point
[-6,7].
Solution. Since the direction vector of p is perpendicular to q, we can write the result immediately as
p : x = -6 + 6t, y = 7 - 7t, teR.
□
l.E.7. Give an example of numbers a, b e R, such that the vector u is a normal to AB where A = [1,2], B = [2b, b], u = (a — b, 3).
Solution. The direction of AB is (2& — 1,6 — 2) (this vector is always nonzero), and therefore the vector (2 — b, 2b — 1) is normal to AB. Setting
2-b = a-b,   2b - 1 = 3,
we obtain a = b = 2. □
I.E.8. Determine the relative position of the lines p, q in the plane for p : 2x — y — 5 = 0, q : x + 2y — 5 = 0. If they are not parallel, determine the coordinates of the intersection.
Solution. Eliminating y yields (Ax—2y— 10)+(a;+2y— 5) =
0, from which x = 3, and hence y = 1. Hence [3,1] is the (unique) intersection and the lines are not parallel. □
1. E.9. Planar soccer player shoots a ball from the point F = [1, 0] in the direction (3,4) hoping to hit the goal which is a line segment from the point A = [23,36] to B = [26,30]. Does the ball fly towards the goal?
Solution. The ball travels along the line [1,0]+ t(3,4). The line AB has the parametrization [23,36] + u(3, —6), where B = [23,36] + 1 ■ (3, —6). The intersection of these lines is given by equations 1 + 3t = 23 + 3u and At = 36 — Qu, with the solutions = 8,u = 2/3. AsO < 2/3 < 1 the intersection is in the segment AB. The ball hits the goal.
Another solution. It is sufficient to consider only the slopes of the vectors FA, (3,4), FB. Since §§ > § > §§, the player scores. □
1.E.10.   Consider the plane R2 with the standard coordinate <25otJ> system. A laser ray is sent from the origin [0,0] in the direction (3,1). It hits the mirror line p given by the equation
p: [4, 3] + t(-2,1)
by points [0, 0], [1,0], [0,1]. Thus we may say that in the geometry of plane, "all nontrivial triangles are the same, up to a choice of coordinates".
1.5.3. Lines in the plane. Every line is parallel to a (unique) line through the origin. To define a line, we therefore need two ingredients. One is a nonzero vector which describes the direction of the line. Call it v = (vi,v2). The other is a point Po = lxo, yo] on the line. Every point on the line is then of the form
P(t) = P0 + tv, teR. Parametric description of a line
We may understand the line p as the set of all multiples of the vector v, shifted by the vector (x0 ,y0). This is called the parametric description of the line:
p = {p e R2; P = P0 + «v, t e R}.
The vector v is called direction vector of the line p.
UNB- EQUATION
In the chosen coordinates, the point P(t) = [x(t), y(t)] is given as
x = x(t) = x0 + t «i,    y = y(t) = y0 + tv2 We can eliminate t from these two equations to obtain
-v2x + vxy = -v2xq + «iy0-
Since the vector v = (v1,v2) is non-zero, at least one of the numbers vi, v2 is non-zero. If one of the coordinates v1 or v2 is zero, then the line is parallel to one of the coordinate axis.
Implicit description of a line
The general equation of the line in the plane is
(1) ax + by = c,
with a and b not both zero. The relation between the pair of numbers (a, 6) and the direction vector of the line v =
(vi,v2)
(2) av1 + bv2 = 0.
We can view the left hand side of the equation (1) as a function z = f(x, y) mapping each point [x, y] of the plane to a scalar and the line corresponds to the prescribed constant value of this function. We shall see soon that the vector (a, 6) is perpendicular to the direction of the line.
30
CHAPTER 1. INITIAL WARM UP
and then it is reflected (the angle of incidence equals the angle of reflection). At which points does the ray meet the line q, given by
q: [7,-10] +t(-1,6)?
Solution. In principle, there could be none, one or two intersections of a ray with a line. First, we inspect possible intersection of the line q with the ray, before the ray touches the mirror line p. In the standard way, we find the intersection of the line q with the line of the initial movement of the ray: [0, 0] + t{3,1). The intersection point is [0,0] + §(3,1) = [||, The ray meets the mirror at the point [6,2], that is [0, 0] + 2(3,1). As 0 < |i < 2 we conclude, that the ray meets the line q before the reflection point.
Let us concentrate now on the rebound ray. The angle between the line p and the direction of the ray can be calculated using 1.5.7 as
(-2,1) -(3,1) __V2 2 '
COS if ■■
V5Vw
therefore p = —45°. The rebounded ray is thus perpendicular to the entering ray and its direction is (1, —3). (Becareful with the orientation! The vector of the direction can also be obtained via reflection (axial symmetry) of the vector perpendicular to the line p.)
The ray meets the mirror at the point [6,2], thus the reflected ray has the equation
[6,2] +f(1,-3), t > 0.
The intersection of the line given by the rebounded ray with the line q is at the point [4,8]. This point lies on the opposite side of the line p to both the incident and reflected rays, (t = —2). Thus the rebound ray does not meet the line q.
All together, there is one intersection of the ray with the line g, namely [f|,|i]. □
Remark. The reflection of a ray in three-dimensional space is studied in the exercise 3.F.4.
l.E.ll. A line segment of length 1 started moving at noon with a constant speed of 1 meter per second in the direction (3,2) from the point [—2,0]. Another line segment of length of 1 has started moving also at noon from the point [5, —2] in the direction (—1,1), but with double speed. Will they collide? (Segments are oriented in direction of their movements.)
Suppose we have two lines p and q. We ask about their intersection p n q. That is a point [x, y] which satisfies the equations of both lines simultaneously. We write them as
(3)
ax + by = r cx + dy = s.
Again, we can view the left side as a mapping F, which to every pair of coordinates [x, y] of the point P in the plane assigns a vector of values of two scalar functions f± and fa given by the left sides of the particular equations (3). Hence we can write our two scalar equations as one vector equation -F(v) = w, where v = (x, y) and w = (r, s).
Notice that the two lines are not parallel if and only if they have a unique point in their intersection.
1.5.4. Linear mappings and matrices. Mappings F with which we have worked with when describing t-*Y//, intersection of lines have one very important property in common: they preserve the operations of addition and multiplication with vectors and scalars, that is they preserve linear combinations:
F(a ■ v + b ■ w) = a ■ F(v) + b ■ F(w)
for all a, b e R, v, w G R2. We say that F is a linear mapping from R2 to R2, and write F: R2 -> R2. This can be also described with words — linear combination of vectors maps to the same linear combination of their images, that is linear mappings are those mappings which preserve linear combinations.
We have already encountered the same behaviour in the equation 1.5.3(1) for the line, where the linear mapping in question was / : R2 —> R and its prescribed value c. That is also the reason why the values of the mapping z = f(x, y) are on the image depicted as a plane in R3.
We can write such a mapping using matrices. By a matrix we mean a rectangular array of numbers, for instance
A=(a   J)    or v=P c   d \y.
31
CHAPTER 1. INITIAL WARM UP
Solution. Lines along which the segments are moving can be described parametrically:
p   : [-2,0]+r(3,2),
q   :    [5, -2] + s(-l,l).
The equation of the line p is
2x - 3y + 4 = 0.
Substituting the parametric equation of the line q yields the intersection point P = [1,2].
Now we choose a single parameter t for both lines so that the corresponding point on p and on q respectively, describes the position of the initial point of the first and second line segment respectively at the time t. At time 0 the initial point of the the first line segment is at [—2,0], the second is at [5, —2]. During time t (measured in seconds) the first segments travels t units of length in the direction (3,2), the second segments travels 2t units of length in the direction (—1,1). Thus the corresponding parameterisations are:
(3,2)
V32 + 22' (-1.1)
q   :     5,-2 ] + 2t .__
1       J       \/(-l)2 + l2 The initial point of the first segment enters the point [1,2] at
time 11 =       s, the initial point of the second segment at
time 12 = \/2 s - more than a half second sooner. At the time
12 + \ = \/2 + \ <t\ the ending point of the second segment
moves away from P. Thus when the initial point of the first
segment enters the point P, the ending point of the second
segment is already away and the segments do not collide. □
We return for a while to complex numbers. The complex plane is basically a "normal" plane, where we have something extra:
I.E. 12. Interpret multiplication by the imaginary unit i and complex conjugation as geometrical transformations in the plane.
Solution. The imaginary unit i corresponds to the point (0,1). Notice that multiplying any number z = a + i b by the imaginary unit i gives the result
i ■ (a + i b) = —b + ia.
Under the interpretation in the plane, this is a rotation around the origin of the segment joining the origin to the point z through a right angle counterclockwise (cf. 1.1.4).
A ■
We speak of a (square 2x2) matrix A and (column) vector v. Multiplication of matrices, row by column, is defined as follows:
'a   b\   f x\     fax + by^ c   d)   \y) ~ \cx + dy/
We introduce some more tools for vectors and matrices. Our goal is to compute with matrices in a similar way as we do it with scalars.
We define the product C = A ■ B of two square matrices A and B applying the above formulas to individual columns of the matrix B and writing the resulting column vectors again as columns in the matrix C.
In order to multiply two vectors v and w in a similar way, we can write the vector w as a row of numbers (the transposed vector) wT. Then the product of wT and v is
rx + sy.
We call this the scalar product of vectors v and w.
We can easily check the associativity of multiplication (do it for general matrices A, B and a vector v in detail):
(A-B)-v = A-(B-v).
Instead of a vector v we can write any matrix C of correct size. In a similar way, distributivity also holds:
A ■ (B + C) = A ■ B + A ■ C,
But the commutativity does not hold. For example,
0 i\ fo o\ _ fo i\ fo o\ fo 1\ _ fo 0 o o)\o l) ~ [o o) '{p i)\o o) ~ [o o
This last product also shows the existence of divisors of zero.
Notice that the mapping defined by multiplication of vectors with a fixed matrix is a linear mapping, i.e. it respects linear combinations. With matrices and vectors we can write the equations for lines and points respectively as
A ■
1.5.5. Determinant of matrix. The procedure of finding the intersection of lines described in 1.5.3 fails in some special cases. For instance the intersection of two parallel lines is either empty (when the lines are parallel but distinct) or the line itself (when the lines are identical). This condition occurs when the ratios a/c and b/d are the same, that is
(1) ad-bc = 0.
Note that this expression already takes care of the cases, where either c or d is zero.
The expression on the left in (1) is called the determinant of the matrix A. We write it as
b
d
del A-
ad — be.
32
CHAPTER 1. INITIAL WARM UP
Taking the complex conjugate is a reflection through the axis of real numbers:
z = (a + i b) H> (a — i b) = z.
□
1.E.13. Determine the sum of the three angles, which are between the vectors (1,1), (2,1) and (3,1) respectively and the x-axis in the plane R2.
Solution. If we view the plane R2 as the Gauss plane (of complex numbers), then the given vectors correspond to complex numbers 1 + i, 2 + i and 3 + i. We are to find the sum of their arguments. According to de Moivre's formula this equals the argument of their product. Their product is (1 +j)(2 + i)(3 + i) = (l + 3i)(3 + i) = m, which is a purely imaginary number with argument tt/2. So the sum we are looking for is it / 2. □ Next, we shall exercise the matrix calculus in the plane.
We refer to 1.5.4 for the basic concepts. First we experience the operations of addition and multiplication on matrices, then we come to geometric tasks.
1.E.14.   Simplify (A - B)T ■ 2C ■ u, where
A = C
-2   2J ' B ~ \-\ f2   -2\ /3
Solution. By substituting in
'4
and by matrix multiplication we obtain
'-2   -1\   A -4
2C =
Bf
-4 10
(A — B)T ■ 2C ■ u =
1
8 10
-52 64
□
1.E.15.   Give an example of matrices A and B for which
(a) (A + B) ■ (A - B)    A ■ A - B ■ B;
(b) (A + B) ■ (A + B)    A ■ A + 2A ■ B + B ■ B.
Solution. For any two square matrices A and B we have
(A + B) ■ (A- B) = A- A- A- B + B ■ A- B ■ B. The identity
(A + B) ■ (A - B) = A - A - B ■ B
Our discussion can be now expressed as follows:
Proposition. The determinant is a real valued function det A defined for all square 2x2 matrices A. The (vector) equation A-v = u has a unique solution for v if and only (/det A =f^ 0.
So far, we have worked with pairs of real numbers in the plane. Equally well we might pose exactly the same questions for points with integer coordinates and lines with equations with integer coefficients. Notice that the latter requirement is equivalent to considering rational coefficients in the equations. We have to be careful which properties of the scalars we exploit.
In fact, we needed all the properties of the field of scalars when discussing the solvability of the system of two equations — try to think it through. At least, we can be sure that the intersection of two non-parallel lines with rational coefficients is a point with rational coefficients again. The case of integer coefficients and coordinates is more difficult. We shall come back to this in the next chapter. In particular we shall see that the equation (1) with fixed integer coefficients a, b, c, d has a unique integer solution for all integer values (r, s) if and only if the determinant is ±1.
1.5.6. Afline mappings. We now investigate how the ma-W trix notation allows us to work with simple map-♦!>|§iflw Pm8s m me aHine plane. We have seen that tCMjML:^ matrix multiplication defines a linear mapping. Shifting R2 by a fixed vector w = (r, s) e R2 in the affine plane can be also easily written in matrix notation:
' x \      „ . I x + rs
If we add a fixed vector to the result of a linear mapping then we have the expression
. fax + by + r\
m- A ■ v + w = , y J \cx + ay + s J
In this way we have described all affine mappings of the plane to itself.
Such mappings allow us to recompute coordinates which arise by different choices of origins or bases. We shall come back to this in detail later.
1.5.7. The distance and angle. Now we consider distance. We define the length of the vector v = (x, y) to be
||v|| = \Jx2 + y2
33
CHAPTER 1. INITIAL WARM UP
is thus obtained if and only if — A ■ B + B - A is the zero matrix, that is if and only if the matrices A and B commute. An example of such matrices are thus pairs of matrices, which do not commute (the matrix of product is changed when we change the order of multiplied matrices). We can choose for instance
A =
1 2
B =
v3 4
since with this choice is
A-B=l*Q B.A:
4 3 2 1
13 20
5 8
Notice that for any pair of square matrices A, B
(A + B) ■ (A + B) = A- A + A- B + B ■ A + B ■ B. It follows that
(A + B) ■ (A + B) = A ■ A + 2A ■ B + B ■ B if and only if A ■ B = B - A, as in the first case. □
1.E.16. Decide whether the mappings F, G : R2 -> R2 given by
„   fx\      (2x + 2y-A\
\y)      yAx-gy + ZJ' ,y
are linear.
Solution. For any vector (x,y)T e R2 we can express
F
G
7 -3 -2 5
2 2
4   -91   \y) + ( 3
This implies that both mappings are affine. Recall that an affine mapping is a linear one if and only if the zero vector maps to zero. Since
"((!))-(?)•
the mapping F is linear, the mapping G is not. Let us mention that j(x) = Ax and g(x) = Ax + b respectively, where x, b e R", A is a square matrix n x n, are general forms of a linear and affine mappings respectively. □
1.E.17. Compute the lengths of the sides of the triangle with vertices A = [2,2], B = [3,0], C = [4,3].
Solution. Using the formula for the length of a vector
Ml = Vu1 + u!>  u = (uiiu2) £ R2
Immediately we can define notions of distance, angle and rotation in the plane.
Distance in the plane
The distance between the points P, Q in the plane is given as the length of the vector P$, i.e. \\Q—P\\. Obviously, the distance does not depend on the ordering of P and Q and it is invariant under shifts of the plane by any fixed vector w.
The Euclidean plane is an affine plane with distance defined as given above.
tUC-UDBAhl DISTANCE P
Angles are a matter of vectors rather than points in Euclidean geometry. Let u be a vector of length 1, at angle ip measured counter-clockwise from the vector (1,0). In coordinates, u is at the unit circle and has first and second coordinates cos p, sin p respectively (this is one of the elementary definitions of the sine and cosine functions). That is,
u = (cos p, sin p). This is compatible with — 1 < sirup < 1 satisfying (cosi^)2 + (sinijs)2 = 1. Angle between vectors
The angle between two vectors v and v' can be in general described using their coordinates v = (x, y), v' = (x1, y') like this:
xx1 + yy'
(1)
cos p ■
It-
In our special case v gives
(1,0), the more general equation
cosy = —-
h II
which is just the definition of the function cos p. The general case can be always reduced to this special one. First we notice that the angle p between two vectors u, v is always the same as the angle between the normalized vectors ^jj|[u and n^y-v.
34
CHAPTER 1. INITIAL WARM UP
we obtain the results
\AB\ = \\A-B\\ = ^(2 - 3)2 + (2 - 0)2 = \/5, \BC\ = \ \B-C\ \ = \/(3 - 4)2 + (0 - 3)2 = a/To, |ylC| = 11^4. — C| | = ,7(2-4)2 +(2-3)2 = ^
□
I.E.18.   Determine the angle between the two vectors
(a) u= (-3,-2),« = (-2,3);
(b) u= (2,6),« = (-3,-9).
Solution. The sought angle 0 < p < it can of course be computed from the formula (1) in 1.5.7. But note that the vector (—3, —2) can be obtained by changing the coordinates of the vector (—2,3) and multiplying one of them by the number —1. But these operations are used when we want to obtain the vector normal to a vector of direction of a given line (or vice versa). Vectors in the case (a) are thus perpendicular, that is p = 7r/2. In the case (b), since —3 ■ (2,6) = 2 ■ (—3, —9), the vector u is a multiple of the vector v. If one vector is a positive multiple of another, the angle between these two is zero. If it is a negative multiple, as in our case, the angle is it.
□
1.E.19. Determine the angle p between the two diagonals A3Ar and A5A10 of a regular dodecagon (polygon with twelve sides) A0A1A2 ... An.
Solution. The angle does not depend neither on the size nor on the position of the given dodecagon. Choose the dodecagon inscribed in a circle with diameter 1.
We can put A0 to [1,0] and then the vertices can be identified with the twelfth roots of 1 in the complex plane. We can write A^ = cos(2A;7r/12) + jsin(2A;7r/12). Especially A3 = cos(7r/2) + isin(7r/2) = i ~ [0,1], A5 = cos(57r/6) + isin(57r/6) = + \i ~ [-&,\],A7 = cos(77r/6) + isin(77r/6)  = - \i ~  [-^,-§], and A10 =
cos(57r/3) =isin(57r/3) = 1/2 - i& ~ [1/2,-^f].
Using the formula (1) in 1.5.7 we finish the computation:
1
cos p = -r==,
2\/2 + v/3
that is p = 75°.
Alternative solution. This problem can be solved via method of synthetic geometry only. Denote the centre of the regular dodecagon by S and the intersection of the diagonals A3A7
Thus we can restrict ourselves to two vectors on the unit circle. Then we can rotate our coordinates in such a way that the first of the vectors will become (1, 0). This means, it is enough to show that the scalar product is invariant with respect to rotations.
We have already seen the expression xx' + yy' in the definition of the angle. We called it the scalar product of vectors. In the special case, when the scalar product is zero, we say that the vectors are perpendicular. Of course the best example of perpendicular vectors of length 1 are the standard basis vectors (1,0) and (0,1).
Notice that our formula for the angle between the vectors is symmetric in the two vector arguments, thus the angle p is always between 0 and it.
We can easily imagine that not all affine coordinates are adequate for expressing the distance and thus for use in the Euclidean plane. Indeed, although we may choose any point O as the origin again, we want also that the basis vectors ei = OEi and e2 = OE2 perpendicular and of length one. Such basis will be called orthonormal. We shall see that the angles and distances computed in such coordinates will always be the same no matter which coordinates are used.
1.5.8. Rotation around a point in the plane. The matrix of any given mapping F : r2 jj2 js easv t0 guess. If the result of applying the mapping is the matrix with columns (a, c) and (b, d), then the first column (a, c) is obtained by multiplying this matrix with the basis vector (1,0) and the second is the evaluation at the second basis vector (0,1).
DOTATION A TO/NT IN W£ TtANB-
We can see from the picture that the columns of the matrix corresponding to rotating counter-clockwise through the angle ip are computed as follows:
fa h\n\ = M\fa jwow-^ \c   a J \u J     \smyj J        \c   d) \1 J     y cosip J
The counter-clockwise direction is called the positive direction, the other direction is the negative direction.
35
CHAPTER 1. INITIAL WARM UP
and A5AW by T. Now \ZA7A5A10\ = 45° (this is the inscribed angle which corresponds to the central angle AjSAw, which is a right angle), furthermore l/As^r^l = 30° (again the inscribed angle corresponding to the central angle A5SA3, which is 60°). Thus the angle A5TA7 is then equal to a complement of the aforementioned angles to 180°, that is 105°. The deviation we are looking for is then 180°-105° = 75°. □
1.E.20. Consider a regular hexagon ABCDEF with vertices labeled in the positive direction, centre at the point S = [1,0] and the vertex A at [0,2]. Determine the coordinates of the vertex C.
Solution. The coordinates of the vertex C can be obtained by rotating the point A around the centre S of the hexagon through the angle 120° in the positive direction:
/cos 120° -sinl20c 1 sin 120°
Rotation matrix
C =
_ l
2
2
V3 2 _ 1 2
cos 120°
-1'
2
(A — S) + S + [1,0]
[l - A -1 ■
%/3 2
Rotation through a given angle tp in the positive direction about the origin is given by the matrix
x\      _ /cos tb   — sinf/A fx^
m- Rip ■ v =   . ;       ; ■
y I        v ysm tp    cos tp J   \V
Now, since we now know how the matrix of the rota-' tion in the plane looks like, we can check that rotation preserves distances and angles (denned by the equation (1) in 1.5.7). Denote the image of a vector v as
x cos tp — y sin tp x sin tp + y cos tp
□
and similarly w' = ■ w for w = (r, s)T, and w' = (r', s')T. We can check that
||v'|| = ||v||, and that x'r' + y's' = xr + ys.
The previous expression can be written using vectors and matrices as follows:
(Rf ■ w)T(Rf ■ v) = wTv.
1.E.21.   An equilateral triangle with vertices [1,0] and [0,1] cgT^ lies entirely in the first quadrant. Find the coordinates of its third vertex.
Solution. The third coordinate is [^ +
+
^tt] (we are rotating the point [1,0] through 60°
□
around [0,1] in the positive direction).
1.E.22. An equilateral triangle has vertices at A = [1,1] and B = [2,3]. Its other vertex lies in the same half-plane as the point S = [0,0]. The triangle is rotated by 60° in the positive direction around the point S, to produce a new triangle. Determine the coordinates of the vertices of the new triangle.
Solution. The points we are looking for have coordinates
-f-s/3, a/3 - |], [\ - §-\/3,\VS + §], [1 -
□
The transposed vector (R^ ■ w)T equals wT ■ R^, where is the so-called transpose of the matrix R^. That is a matrix, whose rows consist of the columns of the original matrix and similarly the columns consist of the rows of the original matrix. Therefore we see that the rotation matrices satisfy the relation ■ R^ = I, where the matrix I (sometimes we denote this matrix just as 1 and mean by this the unit in the ring of matrices) is the unit matrix
1 0 0 1
This leads us to a derivation of a remarkable claim — the matrix F with the property that F ■ = I (we will call such a matrix the inverse matrix to the rotation matrix R^) is the transpose of the original matrix. This makes sense, since the inverse mapping to the rotation through the angle t/j is again a rotation, but through the angle —ip. That is, the inverse matrix of RK equals the matrix
1.E.23.   Find two matrices A such that
Az =
1
2 2
V3 2
1
2
Hint: which geometric transformation in the plane is given by the matrix A2?
R-
cos(—ip) sm(—ip)
— sin(—ip) cos(-ip)
cos tp sin tp — sin tp   cos tp
It is easy to write the rotation around a point P = O +w, P = [r, s] again using a matrix. One just has to note that instead of rotating around the given point P, we can first shift P into the origin, then do the rotation and then do the inverse
36
CHAPTER 1. INITIAL WARM UP
Solution. A2 is the matrix of rotation through 60° in the positive direction, thus the matrices we are looking for are
A = ±
2
1
2
_ 1 2 V3 2
which are the matrices of rotation through 30° or through
210°. □
shift. We calculate:
n> v — w n> Rjf, ■ (v — w)
n> Rip ■ (v — w) + w
cos ip (x — r) — sin ip (y — s) + r smip (x — r) + cosip (y — s)) + s
1.E.24. Reflection. Find the matrix of reflection in the plane through the line y = x (that is, find the matrix of the axial symmetry).
Solution. The given reflection sends a;-axis into y axis and vice versa. Thus the reflection applied to a vector just transposes its coordinates, therefore the sought matrix is
0 1\
1 o)-
A matrix of any linear mapping in R2 can be computed also in standard way: it is given by images of the vectors (1,0) (first column) and (0,1) (second column). In our case the images are (0,1) and (1,0). □
1.E.25. Determine which linear mappings from R2 to R2 are given by the following matrices (that is, describe the geometrical meaning of the matrices):
WAtioh m?om> wm a sum
A1 =
A, =
-1 0
A3 =
2 V2 2
2
2
1.5.9. Reflection. Another well-known example of a length preserving mapping is reflection through a line. It is enough to understand reflection through a line that goes through the origin O. All other reflections can be derived using shifts and rotations.
We look first for a matrix Z^ of reflection with respect to the line through the origin and through the point
(cos -p, sin ip). Notice that
1 0
0 -1
Solution. Let (x,y)T stand for an arbitrary real vector. For the matrix Ai we have
x\ /l 0\ (x\ _ (x^ Ky)^{0 o)\y)-[ol which means that the linear mapping given by this matrix is the projection on the x axis. Similarly we can see that the matrix A2 determines the reflection with the respect to the y axis, since
GKo °H)={7
The matrix A3 can be expressed in the form
/cos ip — sin tp\ \ sin ip    cos ip J
for ip = 7r/4, thus it gives the rotation of the plane around the
origin through the angle tt/4 (in the positive direction, that is
counter-clockwise). □
\
Any line going through the origin can be rotated so that it has the direction (1,0) and thus we can write general reflection matrix as
where we first rotate via the matrix R-^ so that the line is in "zero" position, reflect with the matrix Zq and return back with the rotation R^.
37
CHAPTER 1. INITIAL WARM UP
1.E.26. Show that the composition of an odd number of point reflections in the plane is again a point symmetry.
Solution. The point reflection in the plane across the point S is represented with the formula X i-> S — (X — S), that is X i-> 25* — X. By repeated application of three point reflections across the points S, T and U respectively we obtain X ^ 2S-X ^ 2T-(2S-X) m- 2U-(2T-(2S-X)) = 2(1!-T + S)-X, that is X ^ 2(1!-T + S) - X, which is a point reflection across the point U — T + S. Composition of any odd number of point reflections can be reduced successively to a point reflection. (In principle, this is done by mathematical induction, try to formulate it by yourself). □
1.E.27. Construct a (2n + l)-gon, if the middle points of all its sides are given.
Solution. We use the fact that the composition of an odd number of point reflections is again a point reflection (see the previous exercise). Denote the vertices of the (2n + l)-gon we are looking for by Ai, A2,..., A2n+i and the middle points of the sides (starting from the middle point of A1A2) by Si, 5*2,..., 5*271+1. If we carry out the point reflections across the middle points (from 5*1 to S2n+i), then clearly the point Ai is a fixed point of the resulting point reflection, thus it is its centre point. In order to find it, it is enough to carry out the given point reflection with any point X of the plane. The point Ai then lies in the middle of the line segment XX where X is the image of X in that point reflection. The rest of the vertices A2,..., A2n+i can be obtained by mapping the point Ai in the point reflections across the points Si,..., S2n+i. □ In the next exercises, we exploit the properties of the determinant of a matrix, cf. 1.5.5 and 1.5.7.
1.E.28. Determine the area of the triangle ABC, if A = [-8,1],B= [-2,0], C= [5,9].
Solution. We know that the area equals to the absolute value of the half of the determinant of the matrix, whose first column is given by the vector B — A and the second column by the vector C — A, that is the determinant of the matrix
■-2-(-8) 5-(-8r 0-1 9-1
A simple calculation yields the result
i |(-2 - (-8)) ■ (9 - 1) - (5 - (-8)) ■ (0 - 1)| = f.
Let us add that the change of the order of the vectors leads to change in the sign of the determinant (but the absolute value
Therefore we can calculate (by associativity of matrix multiplication):
cos ip	- sin ip\
simp	COS Ip J
cos ip	sin ip \
siwip	- cos ip J
cos2 ip —	sin2 ip
2 simtp	cos ip
cos 2ip	sin 2ip
0
-1
cos ip sin ip - sin ip   cos ip
cos ip sin ip — sin ip   cos ip
2 sin ip cos ip (cos2 ip — sin2 ip)
sin 2ip   — cos 2ip
The last equality follows from the usual formulas for trigonometric functions:
(1)
sin 2ip = 2 sin ip cos ip cos 2ip = cos2 ip — sin2 ip.
Notice that the product     ■ Z0 gives:
cos 2ip sin 2ip \ (\ 0 \ f cos 2ip — sin 2ip sm2ip   —cos2ipl   \0   —1J     \sia2ip cos2ip
This observation can be formulated as follows:
Proposition. A rotation through the angle ip can be obtained by two subsequent reflections through the lines that have the angle \ip between them.
In fact we can prove the previous proposition purely by Ji, geometrical argumentation, as shown in the above picture (try to be a "synthetic geometer"). If we believe in this proof "by picture", then the above computational derivation of the proposition provides the proof of the standard double angle formulas (1).
The following is a recapitulation of previous ideas.
38
CHAPTER 1. INITIAL WARM UP
is unchanged) and that the value of the determinant would not change at all if we wrote the vertices as rows (preserving the order). Moreover, the determinant formed by vectors B — A and C — A is always positive if the vertices A, B, C are in the anti-clockwise direction. □
1.E.29. Compute the area S of the quadrilateral given by its vertices [1,1], [6,1], [11,4], [2,4].
Solution. First, denote the vertices (in the counter-clockwise direction) as
A = [1,1],   B=[6,l],   C=[ll,4], £»=[2,4].
If we divide the quadrilateral ABCD into the triangles ABC and ACD, we can obtain its area as the sum of the areas of these two triangles, by evaluating the determinants
6-1   11-1      5 10
di =
d, =
I- 1   4-1      0   3 '
II- 1   2 - 1 _ 10 1 4-1   4-1 ~  3   3 '
where in the columns are these vectors B — A, C — A (for d{)
andC-	A, D -		- A (for d2). Then		
S =	di	+	d2		5-3-10-0
	2		2		2
+
10-3-1-3
15 + 27
21.
(thanks to the order of the vectors are all determinants greater than zero). Correctness of the result is easy to confirm, since the quadrilateral ABCD is a trapezoid with bases of lengths 5, 9 and their distance v = 3. □ In the following exercises, we consider non-transparent ■*3t4iki     figures (triangle, quadrangle) in the R2 plane.
We will illustrate the power of the concept of determinant and the oriented area on practical visibility issues in the plane.
1.E.30. Visibility of the sides of a triangle. Let the triangle
with the vertices A = [5,6], B = [7,8], C = [5, 8] be given. Determine, which of its sides are visible from the point P =
[0,1].
Solution. Order the vertices in the positive direction, that is counter-clockwise: [5, 6], [7,8], [5,8]. Using the corresponding determinants we can determine whether the point [0,1] lies to the "left" or to the "right" of the sides of the triangle when we view them as oriented line segments.
Mappings that preserve length
1.5.10. Theorem. A linear mapping of the Euclidean plane is composed of one or more reflections if and only if it is given by a matrix R which satisfies
R =
ab + cd = 0,   a2 + c2 = b2 + d2 = 1.
This happens if and only if the mapping preserves length. Rotation is such a mapping if and only if the determinant of the matrix R equals one, which corresponds to an even number of reflections. When there is an odd number of reflections, the determinant equals —1.
Proof. We calculate how a general matrix A might ■ look, when the corresponding mapping preserves length. That is, we have a mapping
a b
c d
ax + by cx + dy
Preserving length thus means that for every x and y, we have
x2 + y2 = (ax + by)2 + (cx + dy)2
= (a2 + c2)x2 + (b2 + d2)y2 + 2(ab + cd)xy.
Since this equation is to hold for every x and y, the coefficients of the individual powers x2, y2 and xy on the left and right side of the equation must be equal. Thus we have calculated that the conditions put on the matrix R in the first part of the theorem we are proving are equivalent to the property that the given mapping preserves length.
Because a2 + c2 = 1, we can assume that a = cos p and c = sin p for a suitable angle p. As soon as we choose the first column of the matrix R, the relation ab + cd = 0 determines the second column up to a multiple. But we also know that the length of the vector in the second column is one, and thus we have only two possibilities for the matrix R, namely:
cos p — sin p sin p     cos p
cos p sin p sin p   — cos p
In the first case, we have a rotation through the angle p, in the second case we have a rotation composed with the reflection through the first coordinate axis. As we have seen in the previous proposition 1.5.8, every rotation corresponds to two reflections. The determinant of the matrix R is in these two cases either one or minus one and distinguishes between these two cases by the parity of the number of reflections. □
Notice, we have now proved our earlier claim on the invariance of formulae for distance and angle in any orthonor-mal coordinates. Moreover, we have seen that all euclidean affine mappings are generated by translations, and reflections.
39
CHAPTER 1. INITIAL WARM UP
B	- P		7	7
C	- P		5	7
A	-P		5	5
B	- P		7	7
C-	P		5	7
A-	P		5	5
>0,
0.
Not all the determinants are positive, that means P is outside the triangle. In that case, if it is left of some oriented segment (a side of the triangle), the segment is not visible from P (think this over).
Because the last determinant is zero, the points [0,1], [5, 6] and [7, 8] lie on a line, the side AB is thus not visible. The side BC is also not visible, unlike the side AC for which the determinant is negative. □
1.E.31. Which sides of the quadrangle given by the vertices [-2, -2], [1,4], [3,3] and [2,1] are "visible" from the position of the point X = [3, tt - 2] ?
Solution. In the first step we order the vertices such that their order corresponds the counter-clockwise direction. We choose vertex A = [—2, —2], the order of the remaining vertices is then B = [2,1], C = [3,3], D = [1,4] (think, how to order the points without a picture; you can actually use similar procedure to what follows). First consider the side AB. It along with the point X = [3, tt — 2] determines the matrix
-2-3 2-3 -2 - (tt - 2) 1 - (tt - 2)/ such that its first column is the difference A — X and the second column is B — X. Whether it can be "seen" from the point [3, tt — 2] (i.e. is left or right of the oriented line AB, see 1.5.12), is then determined by the sign of the determinant -2-3 2-3     _ -5 -1
-2 - (tt - 2)   1 - (tt - 2) ~ -tt 3-7 -5- (3-rr) - (-1)(-tt) < 0.
For the side BC we analogically obtain
2- 3 3-3     _   -1 0 1 - (tt - 2)   3-(7r-2)~3-7r 5-tt
-1 ■ (5 - tt) - 0 < 0.
And for the sides CD and DA we obtain
3- 3 1-3     _    0 -2 3 - (tt - 2)   4-(tt-2) ~ 5-tt   6 - tt
0- (-2) ■ (5-tt) > 0,
1 - 3 -(tt-2)
-2-3 -2 -(tt-2)
-2
6 — tt
-5
— tt
-2 ■ (-tt) - (-5) ■ (6 - tt) >0.
The determinants differ in signs, thus the point X is outside the given quadrangle and a side is visible (from X), if X is left of the side. According to our convention of putting vectors
1.5.11. Area of a triangle. At the end of our little trip to geometry we will focus on the area of planar objects. For us, triangles will be sufficient. Every triangle is determined by a pair of vectors - v and w, which, if translated so that they start from one vertex P of the triangle, determine the remaining two vertices. We would like to find a formula (scalar function area), which assigns the number areaZ\(v, w) equal to the area of the triangle A(v, w) denned in the aforementioned way. By translating, we can place P at the origin since translation does not change the area.
We can see from the statement that the desired value is half of the area of the parallelogram spanned by the vectors v and w. It is easy to calculate (using the well-known formula: base times corresponding height), or simply observe from the diagram that the following holds
areaZ\(v + v', w) = area A(v, w) + areaZ\(v', w)
areaZ\(av, w) = a area A(v, w).
Same Areas
UNĚAZJ1Y W AF&UMESir
Finally we add to the formulation of our problem a condition
areaZ\(v, w) = — area A(w, v),
which corresponds to the idea that we give a sign to the area, according to the order in which we are taking the vectors.
If we write the vectors v and w into the columns of a matrix A, then the mapping
A = (v, w) i-> det A
satisfies all the three conditions we wanted. How many such mappings could there possibly be? Every vector can be expressed using two basis vectors ei = (1,0) and e2 = (0,1). By linearity, area A is uniquely determined by these vectors. We want
N 1
areaZ\(ei,e2) = -.
40
CHAPTER 1. INITIAL WARM UP
XA, XB, XC, XD into the determinants, the side is visible,   In other words, we have chosen the orientation and the scale
if the corresponding determinant is negative (i.e. X is right of the oriented side). Thus from the point X are visible exactly the sides determined by the pairs of vertices A = [—2, —2], B= [2,1] and B= [2,1], C= [3,3]. □
1.E.32.   Give the sides of the pentagon with vertices at points [-2,-2], [-2,2], [1,4], [3,1] and [2,-11/6], which are visible from the point [300,1]. Solution. For simplifying the notation, put
A = [-2, -2],   £ = [2,-11/6],   C = [3,l], D=[l,i\, E=[-2,2\. The sides BC and CD are clearly visible from the position of the point [300,1]. On the other hand, DE and EA cannot be seen. For the side AB, we compute 2 - 300   2 - 300
-2-1 -¥-1
-302-(-^)-(-298)-(-3) < 0.
This implies that the side can be seen from the point [300,1].
□
F. Relations and mappings
We conclude this chapter by considering briefly some aspects of the language of mathematics. We advise the reader to have a quick look at the definitions of the basic concepts of various relations and their properties, beginning in 1.6.1.
l.F.l. Determine whether the following relations on the set M are equivalence relations:
i) M = {/ : R -> R}, where / ~ g if/(0) = g(0).
ii) M = {/ : R -> R}, where / ~ g if/(0) = 5(1).
iii) M is the set of lines in the plane, where two lines are related if they do not intersect.
iv) M is the set of lines in the plane, where two lines are related if they are parallel.
v) M = N, where m ~ n if S(m) + S(n) = 20, while S(n) stands for the sum of the digits of the integer n.
vi) M = N, where m ~ n if C(m) = C(n), where C(n) = S(n) if the sum of the digits S(n) is less than 10, otherwise we define C(n) = C(S(n)). (Thus always C{n) < 10.)
Solution.
i) We check the three properties of equivalence:
a) Reflexivity: for any real function /, /(0) = /(0).
b) Symmetry: if /(0) =5(0), then also g(0) = /(0).
through the choice of basis vectors, and we choose the unit square to have area equal to one.
Thus we see that the determinant gives the area of a parallelogram determined by the columns of the matrix A. The area of the triangle is thus one half of the parallelogram.
1.5.12. Visibility in the plane. The previous description of fjj 1 the value for oriented area gives us an elegant tool for determining the position of a point relative to ori-t ented line segments. By an oriented line segment we mean two points in the plane R2 with a selected order. We can imagine it as an arrow from one point to the other. Such an oriented line segment divides the plane into two half-planes. Let us call them "left" and "right". We want to be able to determine whether a given point is in the left or right half-plane.
Such tasks are often met in computer graphics when dealing with visibility of objects. We can imagine that an oriented line segment can be "seen" from the points to the right of it and cannot be seen from the points to left of it.
We have the line segment AB and are given some point C. We calculate the oriented area of the corresponding triangle determined by the vectors C — A and B — A. If the point C is to the left of the line segment, then with the usual positive orientation (counter-clockwise) we obtain the negative sign of the oriented area (showing the non-visibility), while the positive sign corresponds to the points to the right.
This approach is often used for testing relative positions in 2D graphics.
6. Relations and mappings
In the final part of this introductory chapter, we return to the formal description of mathematical structures. We will try to illustrate them on examples we already know. We can consider this part to be an exercise in a formal approach to the objects and concepts of mathematics.
1.6.1. Relations between sets. First we define the Cartesian product A x B of two sets A and B. It is the set of all ordered pairs (a, b) such that a e A and b e B. A binary relation between the two sets A and B is then a subset R of the Cartesian product Ax B.
41
CHAPTER 1. INITIAL WARM UP
c) Transitivity: if/(O) = g(0) and g(0) = h(0), then also /(0) = h(0). We conclude that the relation is an equivalence relation.
ii) No. The relation is not reflexive, since for instance for the function sin we have sin 0 ^ sin 1. It is not transitive.
iii) No. The relation is not reflexive (every line intersects itself). It is not transitive.
iv) Yes. The equivalence classes then correspond to unori-ented directions in the plane.
v) No. The relation is not reflexive. S(l) + S(l) = 2. It is not transitive.
vi) Yes. q
l.F.2. Let the relation R be defined over R2 such that ((a, 6), (c, d)) G R for arbitrary a,b,c,d G R if and only if b = d. Determine whether or not this is an equivalence relation. If it is, describe geometrically the partitioning it determines.
Solution. From ((a,b), (a, &)) G R for all a,b G R it is implied that the relation is reflexive. Equally easy to see is that the relation is symmetric, since in the equality of the second coordinates we can interchange the left and right side. If ((a, 6), (c, d)) G R a ((c, d), (e, /)) G R, that is, b = d and d = /, we easily get that the transitivity condition ((a,b), (e,/)) G i?,thatis& = /. The relation R is an equivalence relation, where the points in the plane are related if and only if they have the same second coordinate (the line they determine is perpendicular to the y axis). The corresponding partition then divides the plane into the lines parallel with the x axis. □
l.F.3. Determine how many distinct binary relations can be defined between the set X and the set of all subsets of X, if the set X has exactly 3 elements.
Solution. First, notice that the set of all subsets of X has exactly 23 = 8 elements, and thus the Cartesian product with X has 8 ■ 3 = 24 elements. Possible binary relations then correspond to subsets of this Cartesian product, and of those there are 2 24. □
We write a ~^ b to mean (a, 6) G R, and say that a is related to b. The domain of the relation is the subset
D = {a G A : 3b G B, (a, b) G R}.
Here the symbol 3b means that there is at least one such b satisfying the rest of the claim.
Similarly, the codomain of the relation is the subset
I = {b G B : 3a G A, (a, b) G R}.
A special case of a relation between sets is a mapping from the set A to the set B. This is the case when every element of the domain of the re-lation is related to exactly one element of the codomain. Examples of mappings known to us are all functions, where the codomain of the mapping is a set of numbers, for instance the set of integers or the set of real numbers, or the linear mappings in the plane given by matrices. We write
f :DCA^I CB, f(a) = b
to express the fact that (a, 6) belongs to a relation, and we say that b is the value of / at a. Furthermore we say that
• mapping / of the set A to the set B is surjective (or onto), ifD = AandI = B, darify?
• mapping / of the set A to the set B is injective (or one-to-one), if D = A and for every b G I there exist exactly onepreimage a G A, /(a) = b.
Expressing a mapping / : A —> B as a relation
fCAxB,   f = {(a,f(a));aeA} is also known as the graph of a mapping f.
• J>1       <>fis het hjectiVe
■oma.m
■ is nut SHrjectiVe
l.F.4.
tions
Give the domain D and the codomain I of the rela-
R = {(a,v), (b,x), (c,x), (c,u), (d,v), (f,y)} between the sets A   =    {a, b, c, d, e, /} and B = {x, y, u, v, w}. Is the relation R a mapping?
1.6.2. Composition of relations and functions. For mappings, the concept of composition is clear. Suppose we have two mappings / : A —> B and g : B —> C. Then their composition g o / : A —> C is defined as
(s°/)(a)=S(/(a)).
42
CHAPTER 1. INITIAL WARM UP
Solution. Directly from the definition of the domain and the codomain of a relation we obtain
D = {a, b, c, d, /} C A,   I = {x,y,u,v} C B.
It is not a mapping since (c, x), (c, u) G R, that is c G D has two images. □
l.F.5. Determine for each of the following relations over the set { a, b, c, d} whether it is an ordering and whether it is complete:
i?i = {(a, a), (6, 6), (c, c), (d, d), (b, a), (6, c), (6, d)}, R2 = {(a, a), (b, b), (c, c), (d, d), (d, a), (a, d)}, Ri = {(a, a), 0, b), (c, c), (d, d), (a, 6), (6, c), (6, d)}, R4 = {(a, a), (6, 6), (c, c), (a, 6), (a, c), (a, d), (b, c), (6, d), (c, d)},
i?5 = {(a, a), (6, 6), (c, c), (d, d), (a, 6), (a, c), (a, d), (6, c), (b,d),(c,d)}.
Solution. i?i is an ordering, which is not complete (for instance neither (a, c) ^ i?i nor (c, a) ^
The relation R2 is not anti-symmetric as it is both (a, d) G i?2 and (d, a) G i?2, therefore it is not an ordering (it is an equivalence).
The relations R3 and R4 are also not an ordering, since they are not transitive (for instance (a, 6), (6, c) G R3,R4, (a, c) <£ R3, R4) and also R4 is not reflexive ((d, d) ^ i?^).
The relation R5 is a complete ordering (if we interpret (a, 6) G i?5 as a < b, then a < b < c < d). □
l.F.6. Determine whether or not the mapping / is injective (one-to-one) or surjective (onto), when
(a) / :ZxZ^Z,   f((x, y)) = x + y - Wx2;
(b) / :N^NxN,   f(x) = (2x, x2 + 10).
Solution. In the case (a) is given a mapping which is surjective (it is enough to set x = 0) but not injective (it is enough to set (x, y) = (0, —9) and (x, y) = (1, 0)). In the case (b) it is an injective mapping (both its coordinates, that is functions y = 2x and y = x2 + 10 are clearly increasing over N). The mapping is not surjective (for instance the pair (1, 1) has no preimage). □
l.F.7. In the following three figures, icons are connected with lines such that people in different parts of the world could have assigned them. Determine whether the connection is a
Composition can also be expressed with the notation used for a relation as
fCAxB,   f = {(a,f(a));aeA} gCBxC,   g = {(b,g(b));beB} gofCAxC,    gof = {(a,g(f(a)));aeA}.
The composition of a relation is defined in a very similar j^za      way- We just add existential quantifiers to the i|^£      statements, since we have to consider all possi-s/rfEvt*   ^e "preimages" and all possible "images". Let
'^rS^-L- RCAxB, SCBxCbe relations. Then
S o R C A x C,
SoR= {(a, c); 3b G B, (a, b) G R, (b, c) G S}. A special case of a relation is the identity relation
idA = {(a, a) e Ax A;a e A}
on the set A. It is a neutral element with respect to composition with any relation that has A as its codomain or domain.
Composition ef nhi/bns.:
points. Which cah be reached
a. path ^/VM left it> fi^kt
ate in ihc reUtfOM^
For every relation R C A x B, we define the inverse relation
R-1 = {(b, a); (a, b) G R} C B x A.
Beware, the same term is used with mappings in a more specific situation. Of course, for every mapping there is its inverse relation, but this relation is in general not a mapping. Therefore we speak about the existence of an inverse mapping if every element b G B is an image of exactly one element in A. In such a case the inverse mapping is exactly the inverse relation.
Note that the composition of a mapping and its inverse mapping (if it exists) is the identity mapping. In general, this is not so for relations.
1.6.3. Relation on a set. In the case when 4 = Bwe speak about a relation on the set A. We say that the relation R is:
• reflexive, if icU C R, that is (a, a) G R for every a G A,
• symmetric, if R-1 = R, that is if (a, 6) G R, then also
(b, a) G R,
• antisymmetric, if R-1 n R C icU, that is if (a, 6) G R and if also (b, a) G R, then a = b,
43
CHAPTER 1. INITIAL WARM UP
mapping, and whether it is injective, surjective or bijective
Solution. In the first figure the connection is a mapping which is surjective but not injective, because both the snake and the spider are labeled as poisonous. The second figure is not a mapping but only a relation, since the dog is labeled both as a pet and as a meal. The third connection is again a mapping. This time it is neither injective nor surjective. □
l.F.8. Determine the number of mappings from the set {1,2} to the set {a, b, c}. How many of them are surjective and how many are injective?
Solution. To the element 1 we can assign any of the elements a,b,c. Similarly for the element 2 we can assign any of the elements a, b, c. Thus there are exactly 32 mappings of the set {1,2} to the set {a, b, c}. None of them can be surjective, since the set {a, b, c} has more elements than the set {1,2}. The mapping is injective if and only if the elements 1 and 2 are mapped to different elements. There are three possibilities for the image of 1, after the image of 1 is given, there remain two possibilities for the image of 2. Thus the number of injective mappings of the set {1,2} to the set {a, b, c} is 6. □
l.F.9.   Determine the number of surjective mappings of the
set {1,2,3,4} to the set {1,2,3}.
Solution. We can determine the number by subtracting the number of non-surjective mappings from the number of all mappings. The number of all mappings is V(3,4) = 34. Non-surjective mappings have either a one element, or a two
transitive, if R o R C R, that is if (a, 6) G R and (b, c) G R implies (a, c) G R.
A relation is called an equivalence relation if it is reflexive, symmetric and transitive.
A relation is called an ordering if it is reflexive, transitive and antisymmetric. Orderings are usually denoted by the symbol <, that is the fact that element a is in relation with element b is written as a < b.
Notice that the relation <, that is "to be strictly smaller than", is not an ordering on the set of real numbers, since it is not reflexive.
A good example of an ordering is set inclusion. Consider the set 2A of all subsets of a finite set A. We have a relation C on the set 2A given by the property "being a subset". Thus X C Z if X is a subset of Z. Clearly all three conditions from the definition of ordering are satisfied: if X C Y and Y C X then necessarily X and Y must be identical. If X C Y C Z then also X C Z, and reflexivity is clear from the definition.
We say that an ordering < on a set A is complete, if every two elements a,b G A are comparable, that is, either a < b or b < a.
If A contains more than one element, there exist subsets X and Y where neither X CY nor Y C X, so the ordering C is not complete on the set of all subsets of A.
The set of real numbers with the usual < is complete. Thus the subdomains N, Z, Q come equipped with a complete ordering, too. On the other hand, there is no such natural ordering on C. The absolute value is only a partial ordering there (comparing the radii of the circles in the complex plane).
1.6.4. Partitions of an equivalence. Every equivalence relation R on a set A defines also a partition of the set A, consisting of subsets of mutually equivalent elements, namely equivalence classes. For any a G A we consider the set of elements,
which are equivalent with a, that is
[a] =Ra = {beA; (a,b) G R}.
Clearly a G Ra by reflexivity. If (a, b) G R, then Ra = Rb by symmetry and transitivity. Furthermore, if RaC\Rb=^ 0 then there is an element c in both Ra and Rb so that Ra = Rc = Rb- It follows that for every pair a, b, either Ra = Rb, or Ra and Rb are disjoint. That is, the equivalence classes are pairwise disjoint. Finally, A = UaeARa- That is, the set A is partitioned into equivalence classes. We sometimes write [a] = Ra, and by the above, we can represent an equivalence class by any one of its elements.
1.6.5. Existence of scalars. As before, we assume to know what sets are, and indicate the construction of the natural numbers.
We denote the empty set by 0 (notice the difference between the symbol 0 for the zero and the empty set 0) and define
(1)
0 : =
. + 1 := n U {n}
44
CHAPTER 1. INITIAL WARM UP
element codomain. There are just three mappings with a one element codomain. The number of mappings with a two-element codomain is (^)(24 — 2) (there are (2) ways to choose the codomain and for a fixed two-element codomain there are 24 — 2 ways how to map four elements onto them). Thus the number of surjective mappings is
34- (^(24 - 2)- 3 = 36.
1.F.10. Write down all the relations over a two-element set {1,2}, which are symmetric but are neither reflexive nor transitive.
Solution. The reflexive relations are exactly those which contain both pairs (1,1), (2,2). This excludes relations
{(1,1), (2,2)}, {(1,1),(2,2),(1,2)}, {(1,1),(2,2),(2,1)},{(1,1),(2,2),(1,2),(2,1)}.
We claim that the remaining relations, which are symmetric but not transitive, must contain (1,2), (2,1). If sucharelation contains one of these two (ordered) pairs, it must by symmetry contain also the other. If it contains neither of these pairs, then it is clearly transitive. From the total number of 16 relations over a two-element set we have thus selected
{(1,2), (2,1)}, {(1,2),(2,1),(1,1)}, {(1,2),(2,1),(2,2)}.
It is clear that each of these 3 relations is symmetric but neither reflexive nor transitive. □
l.F.ll. Consider the set of numbers that have five digits in the binary notation and a relation such that two numbers are related whenever their digit sum has the same parity. Write down the corresponding equivalence classes.
Solution. We have two equivalence classes (of eight members): [10000] = {10000,10011,10101,10110,11001, 11010,11100,11111} which corresponds to the set
{16,19,21,22,25,26,28,31}
and [10001] = {10001,10010,10100,11000,10111, 11011,11101,11110} which corresponds to the set
in other words
O:=0, 1:={0}, 2:={0,l},...,7i + l:={0,l,...,7i}.
This notation says that if we have already denned the numbers 0,1,2,... n, then the number 71 +1 is denned as the set of all previous numbers.
We have defined the set of natural numbers N.3 Next, we should construct the operations + and ■ and deduct their required properties. In order to do that in detail, we would have to pay more attention to basic understanding of sets. For example, once we know what a disjoint union of sets is, we may define the natural number c = a+b as the unique natural number c having the same number of elements as the disjoint union of a and b.
Of course, formally speaking, we need to explain what does it mean for two sets to have the same number of elements. Let us notice that in general, having the two sets A and B of the same "size" should mean that there exists a bijection A —> B. This is completely in accordance with our intuition for finite sets. However, it is much less intuitive with infinite sets. For example there is the same amount of all natural numbers and those with natural square roots (the bijection a i-> a2), although the example l.G.l could be read as "most of natural numbers do not have a rational square root". We say, that each set which is bijective to natural numbers N is countable. Sets bijective to some natural number 71 (as defined above) are called finite (with number of elements 71), while the sets which are neither finite nor countable are called uncountable.
We can also define a relation < on N as follows: m < 71, if either m e n or m = 71. Clearly this is a complete ordering. For instance 2 < 4, since
2 = {0, {0}} e {0, {0}, {0, {0}}, {0, {0}, {0, {0}}}} = 4.
In other words, the recurrent definition itself gives the relation 71 < 71+1. and transitivity then gives 71 < k for all k obtained in this manner later.
This ordering of the positive integers or natural numbers (the number a is strictly smaller than b if a e b) has obviously got the following striking property: every subset in N or Z+ has a smallest element.
1.6.6. Integers and rational numbers. With the set N of J* 1, positive integers together with zero, we can always add two numbers together. Also, adding zero to a number does not change it. We can also define subtraction, but the result does not always belong to N. The basic idea of construction of the integers from the natural numbers or positive integers is to add to N these missing results. This can be done as follows: instead of subtraction, we will work with ordered pairs of numbers. It just remains to define which such pairs are equivalent (with respect
{17,18,20,24,23,27,29,30}.
□
The concept of natural numbers based on the principle of "increasing by one" was known to all ancient civilisations, however they always had the smallest natural number one. The set theoretical approach was developped in 19th century and there zero got a logical smallest natural number as the counterpart of the empty set.
45
CHAPTER 1. INITIAL WARM UP
1.F.12. Consider the set of numbers that have three digits in the ternary notation and a relation such that two numbers are in the relation whenever they
i) begin with the same two digits in this notation,
ii) end with the same two digits in this notation.
Write down the corresponding equivalence classes. Solution.
i) We obtain six three-element classes
[100] = {100,101,102} corresponds {9,10,11} [110] = {110, 111, 112} corresponds {12,13,14} [120] = {120,121,122} corresponds {15,16,17} [200] = {200,201,202} corresponds {18,19,20} [210] = {210,211,212} corresponds {21,22,23} [220] = {220,221,222} corresponds {24,25,26}.
ii) In this case we have nine two-element classes	
[100]	= {100,200} corresponds {9,18}
[101]	= {101,201} corresponds {10,19}
[102]	= {102,202} corresponds {11, 20}
[110]	= {110,210} corresponds {12, 21}
[111]	= {111, 211} corresponds {13, 22}
[112]	= {112,212} corresponds {14, 23}
[120]	= {120,220} corresponds {15, 24}
[121]	= {121,221} corresponds {16, 25}
[122]	= {122,222} corresponds {17, 26}
□
1.F.13.   Determine the number of equivalence relations over
a set {1,2,3,4}.
Solution. We divide the sought equivalences according to the types of corresponding partitions (given by number and cardinality of equivalence classes), and we count the number of partitions of a given type:
The type of partition	number of partitions of this type
1,1,1,1	1
2,1,1	Q
2,2	Ht)
3,1	0
4	i
In total we have 15 different equivalences. □
to the result of subtraction). The necessary relation is then:
(a,b) ~ (a',b') a-b = a'-b' a + b' = a' + b.
Note that the expression in the middle equation may not belong to N, but the expression on the right always does. It is easy to check that it really is an equivalence, and we denote its classes as the integers Z. We define addition and subtraction on Z using representatives. For instance
[(a,b)} + [(c,d)} = [(a + c,b + d)},
which is clearly independent of the choice of representatives.
It is always possible to choose a representative (a, 0) for natural numbers a, and a representative (0, a) for negative numbers —a. This is probably the simplest and easiest choice.
If we define multiplication of integers similarly to the addition, we have all the properties (CG1)-(CG4) and (R1)-(R4), see the paragraph 1.1.1. For multiplication, the neutral element is one, but for all numbers a other than zero and ±1 there does not exist an integer a-1 with the property a ■ a-1 = 1. Thus, for multiplication, we are missing the inverse elements. However, the property of the integral domain (ID) holds. This means that if the product of two integers equals zero, then at least one of them has to be zero.
We can construct the rational numbers Q by adding all the missing multiplicative inverses by a method analogous to the construction of Z from N. On the set of all ordered pairs (p, q), q =^ 0, of integers, we define a relation ~ so that it models our expectation of the fractions p/q:
(p, q) ~ (p', q') p/q = p'/q' p ■ q' = p' ■ q.
Again, we are not able to formulate the expected behaviour in the middle equation when we work in Z, but for the equation on the right this is indeed possible. This relation is a well-defined equivalence (think it through!). If we formally write p/q instead of pairs (p, q), we can define the operations of multiplication and addition by the well-known formulas
p/q ■ r/s = pr/qs p/q + r/s = ps/qs + qr/qs = (ps + qr)/qs.
1.6.7. Remainder classes. Another example of equivalence /ggj;        classes is the remainder classes of integers. For a fixed natural number k we define an equiva-iiXttsr /   lence -. so that two numbers <i, b   ' are equiv-3*4?^—   alent if they have the same remainder when divided by k. The resulting set of equivalence classes is denoted as Zfc. This procedure is simplest for k = 2. This yields Z2 = {[0], [1]}, where zero stands for even numbers and one for odd numbers. It is easy to see that using representatives we can correctly define addition and multiplication for each Zfc.
46
CHAPTER 1. INITIAL WARM UP
Remark. In general, the number of partitions of a given n-element set is given by the Bell number Bn, satisfying a recurrence formula
B,
n+l
E
k=0
Bh
(devide the partitions according to the cardinality of the set, to which one fixed element belongs)
1.F.14.   Determine the number of orderings of a four-element set.
Solution. We will consider all possible Hasse diagrams of orderings over a four-element set M. We count how many different orderings (recall that an ordering is a subset of a set M x M) the given Hasse diagram has. See the diagram:
# * • •	I-	n	•A	• V		N	K
i			*		Hi	Hi	f
	ST			Y	K	<>	f
*	V. b		Hi	*		*k	Hi
Remainder classes rings and fields
Theorem. The remainder class is always a commutative ring of scalars. It is a commutative field of scalars (that is, the property (F)from the paragraph 1.1.1 is also satisfied) if and only ifk is a prime.
Ifk is not prime, then contains a divisor of zero, thus it is not an integral domain.
Proof. The second part is easy to see — if x ■ y = k for natural numbers x, y, then the result of multiplying the corresponding classes [x] ■ [y] is zero.
On the other hand, if x and k are relatively prime, then according to the Bezout equality, (which we derive later, see 11.1.2), there are natural numbers a and b satisfying
ax + bk = 1,
which for corresponding equivalence classes gives
[a] ■ [x] + [0] = [a] ■ [x] = [1] and thus \a] is the inverse element to \x]. □
In total, there are 219 orderings over a four-element set. □ There are many combinatorics problems which refer to relations. You can find some of them in the additional exercises after this chapter, starting with 1.G.71.
47
CHAPTER 1. INITIAL WARM UP
G. Additional exercises for the whole chapter
l.G.l.   Let t and m be positive integers. Show that the number y/i is either integer or is not rational.
[V Solution. We shall exploit the basic divisibility rules of integers which we shall discuss in detail later in chapter 11. Show that if the number is not integer, then it cannot be rational. If y/i is not integer, then there exists a prime r and integer s such that rs divides t, rs+1 does not divide t (this we write as ordr t = s) and m does not divide s . Assume that y/i = |, p, q e Z, in other words t ■ qm = pm. Consider ordr L and ordr R and their divisibility by the number m. (L and R denote the left-hand and right hand side of the equation respectively). □
l.G.2. Find the algebraic form of the expressions: i)
Ü)
(W)' (1+V2i]
o
l.G.3.   In the complex plane draw the solutions of the equations:
i) z = \z\,
ii) \z2 + 1| = 1,
iii) Rez = Re(z + 1). Solution.
I.G.4.   Mark the following sets in the complex plane:
i) {z e C| \z - 1| = \z + 1|},
ii) {z e C| 1 < \z - i\ < 2},
iii) {z G C| Re(z2) = 1},
iv) {z eC| Re(i) < |}.
Solution.
(i) the imaginary axis,
AI
Draw images!
(ii) annulus aroumd i,
J	Vim/
	
	-i
	-
□
1*
48
CHAPTER 1. INITIAL WARM UP
(iii) hyperbola a2 — b2 = 1,
(iv) exterior of the unit disc centered at 1.
1	^ Il'vl/	
	11 ^	
□
l.G.5.   Consider an "assignment", which sends every real number a to a root of x2 + x + a = 0. Does it give a function
IR C?
Solution. No, the prescription is not unique, there is always a choice from two numbers, except for a = 1/4.
□
l.G.6. Determine the number of ways of placing the white tower and black tower on the chessboards (of size 8 x 8), that are neither in the same column nor in the row.
Solution. First, we can place the white tower in any of 82 positions. Then we have "to our disposal" 72 positions in which to place the black tower. The total number of ways is 82 ■ 72 = 3136. □
l.G.7.   There were six men in the meeting. If all of them shook hands with each other, how many handshakes have happened?
Solution. The number of handshakes equals the number of ways of choosing an unordered tuple among 6 elements, thus the result is c (6,2) =     = 15. □
l.G.8. Determine in how many ways a four-member committee can be chosen among 15 deputies, if it is not allowed for two certain deputies to work together.
Solution. The result is
(?)-(?) = 1287.
It can be obtained by first calculating the number of all four-member committees and then subtracting the number of those committees where the given two deputies are chosen together (in that case, we only choose two more members among the remaining 13 deputies). □
l.G.9. In how many ways can we divide 8 women and 4 men in two six-member groups (which are considered unordered) in such a way that there is at least one man in each group?
Solution. If we forget the last condition, division of 12 people in two six-member groups can be done by just choosing 6 people and put them to the first group, which can be done in (g2) ways. The groups are not distinguishable (we do not know which one is the first one), thus the total number is rather \ ■ (g2). In (2) cases all men are in one group (we choose two women among eight to complete the group). The correct answer is thus
49
CHAPTER 1. INITIAL WARM UP
1.G.10.   Determine the number of even four-digit numbers composed of exactly two distinct digits.
Solution. Analogously to 1.C.6, we ignore first the peculiarities of the digit zero. We obtain @(24 - 2) + 5 ■ 5(23 - 1) numbers (In the first summand, we count the numbers that consist only of even digits. In the second summand we count the number of even four-digit numbers with one digit even and one digit odd). Again we have to subtract the numbers that start with zero, of those there are (23 — 1)4 + (22 — 1)5. The final number is thus
^ (24 - 2) + 5 ■ 5(23 - 1) - (23 - 1)4 - (22 - 1)5 = 272. ^
l.G.ll.   What is the number of 4-digit numbers composed of digits 1,3,5, 6,7 and 9, where no digit occurs more than once?
Solution. We have 6 distinct letters at our disposal. We ask: how many distinct ordered 4-tuples can be chosen from them? The result is v (6,4) = 6 ■ 5 ■ 4 ■ 3 = 360. □
1.G.12. The Greek alphabet consists of 24 letters. How many words of exactly five letters can be composed in it? (Disregarding whether the words have some actual meaning or not.)
Solution. For each of the five positions in the word we have 24 possibilities, since the letters can repeat. The result is then
K(24,5) = 245. □
1.G.13. In a long-distance race, where the racers start one after another in given time intervals, there were k racers, among them 3 friends. Determine the number of starting schedules in which no two of the 3 friends start next to each other. For simplicity assume k > 5.
Solution. Remaining k — 3 racers can be ordered in (k — 3)1 ways. For the three friends there are then k — 2 places (the start, the end and the k — 4 spaces) where we can put them in v (k — 2,3) ways. Using the rule of (combinatorial) product, we obtain
(k - 3)! ■ (k - 2) ■ (k - 3) ■ (k - 4) = (k - 2)! ■ (k - 3) ■ (k - 4).
1.G.14. There are 32 participants of a tournament. The organisers have stated that the participants must divide arbitrarily into four groups, such that the first one has size 10, the second and the third 8, and the fourth 6. In how many ways can this be done?
Solution. We can imagine that from 32 participants we create a row, where first 10 are the first group, next 8 are the second group and so on. There are 32! orderings of all participants. Note that the division into groups is not influenced if we change the order of the people in the same group. Therefore the number of distinct divisions equals
P (10,8,8,6) = 10!.g2.gi.6r ^
1.G.15. We need to accommodate 9 people in one four-bed room, one three-bed room and one two-bed room. In how many ways can this be done?
Solution. If we assign to the people in the four-bed room the number 1, in the three-bed room number 2 and in the two-bed room number 3, then we create permutations with repetitions from the elements 1, 2, 3, where 1 occurs four times, 2 three times and 3 two times. Number of such permutations is
P(4,3,2) = ^ = 1260. D
1.G.16. Determine the number of ways how to divide among three people A, B and C 33 distinct coins such that A and B together have twice as many coins as C.
Solution. From the problem statement it is clear that C must receive 11 coins. That can be done in (33J ways. Each of the remaining 22 coins can be given either to A or to B, which gives 222 ways. Using the rule of product we obtain the result
(11) ■ 222. □
50
CHAPTER 1. INITIAL WARM UP
1.G.17.   In how many ways can we divide 40 identical balls among 4 boys?
Solution. Let us add three matches to the 40 balls. If we order the balls and matches in a row, the matches divide the balls in 4 sections. We order the boys at random, give the first boy all the balls from the first section, give the second boy all the balls from the second section and so on. It is now evident that the result is (433) = 12 341. □
1.G.18.   According to quality, we divide food products into groups 7, II, III, IV. Determine the number of all possible divisions of 9 food products into these groups, such that the numbers of products in groups are all distinct.
Solution. If we directly write the considered groups from the elements of 7, II, III, IV, we create combinations of repetitions of the ninth-order from four elements. The number of such combinations is (g2) = 220. □
1.G.19. In how many ways could the table of the first soccer league ended, if we know only that at least one of the teams Ostrava, Olomouc is in the table after the team of Brno (there are 16 teams in the league).
Solution. Let us first determine the three places where the teams of Brno, Olomouc and Ostrava ended. Those can be chosen in c(3,16) = (g6) ways. From 6 possible orderings of these three teams on the given three places only four satisfy the given condition. After that, we can independently choose the order of the remaining 13 teams at the remaining places of the table. Using the rule of product, we have the solution
16^
) • 4 • 13! = 13948526592000.
□
1.G.20.   How many distinct orderings (in a row) at a picture of a volleyball team (6 players), if
i) Gouald and Bamba want to stand next to each other;
ii) Gouald and Bamba want to stand next to each other and in the middle;
iii) Gouald and Kamil do not want to stand next to each other.
51
CHAPTER 1. INITIAL WARM UP
Solution.
i) In this case Gouald a Bamba can be considered a single person, we just multiply then by two to determine their relative order. Thus we have 2.5! = 240 orderings.
ii) Here it is similar except that the position of Gouald and Bamba is fixed. We have 2.4! = 48 orderings.
iii) Probably the simplest approach is to subtract the cases where Kamil and Gouald stand next to each other (see (i)). We get 6! - 2.5! = 720 - 240 = 480.
1.G.21. Coin nipping. We flip a coin six times.
i) How many distinct sequences of heads and tails are there?
ii) How many sequences with exactly four heads are there?
iii) How many sequences with at least two heads are there?
o
1.G.22. How many distinct anagrams (rearrangements of letters) of the word "krakatit", such that between the letters "k" there is exactly one other letter.
Solution. In the considered anagrams there are exactly six possibilities of placement of the group two "k", since the first of the two "k" can be placed at any of the positions 1 — 6. If we fix the spots for the two "k", then the other letters can be placed arbitrarily, that is, in P(l, 1,2,2) ways. Using the rule of product, we have
61,2, 2) = |^ = 1080.
2-2 n
1.G.23. How many anagrams of the word BASILICA are there, such that there are no two vowels next to each other and no two consonants next to each other?
Solution. Since there are four vowels and four consonants in the word, each such anagram is either of the type BABABABA or ABABABAB. On the given four places we can permute vowels in P0(2,2) = ^ ways and independently of that also the consonants (4! ways). Using the rule of product, the result is then 2 ■ 4! ■ ^ = 288. □
1.G.24.   In how many ways can we divide 9 girls and 6 boys into two group such that each group contains at least two boys?
Solution. We divide the boys and the girls independently: 29(25 - 7) = 12800. □
1.G.25. Material is composed of five layers, each of them has fibres in one of the possible six directions. How many of such materials are there? How many of them have no two neighbouring layers which have fibres in the same direction?
Solution. 65 and 6 ■ 55. □
1.G.26.   For any fixed n e N determine the number of all solutions to the equation
xi+x2-\-----\- Xk = n
in the set of positive integers.
Solution. If we look for a solution in the domain of positive integers, then we note that the natural numbers x1,... xk are a solution to the equation if and only if the non-negative integers    = Xi — 1, i = 1,..., k are a solution to the equation
Ui + V2 H-----\-yk = n-k.
Using 1 .C. 13, there are (lz{) of them. □
52
CHAPTER 1. INITIAL WARM UP
1.G.27.   There are n forts on a circle (n > 3), numbered in a row with numbers 1,..., n. In one moment of time each of , the forts shoots at one of its neighbours (fort 1 neighbours also with the fort n). Denote by P(n) the number of all possible results of the shooting (a result of the shooting is a set of numbers of those forts that were hit, regardless of the number of hits taken). Prove that P(n) and P(n + 1) are relatively prime.
Solution. If we denote the forts that were hit by a black dot and the unhit by a white dot, the task is equivalent to the task to determine the number of all possible colourings of n dots on a circle with black and white colour, such that no two white dots have "distance" one. For odd n this number is equal to K(n) - the number of colourings with black and white, such that no two white dots are adjacent (we reorder the dots such that we start with the dot one and proceed increasingly with odd numbers, and then increasingly with even). For even n this number equals K(n/2)2, the square of the colouring of n/2 dots on a circle such that no two white are adjacent (we colour independently the dots on even positions and on odd positions).
For K(n) we easily derive a recurrent formula K(n) = K(n — 1) + K(n — 2). Furthermore, we can easily compute that K(2) = 3, K{3) = 4, K(A) = 7, that is, K(2) = F(4) - F(0), K{3) = F(5) - F(l), K(A) = F(6) - F(2), and using induction we can easily prove that K(n) = F(n + 2) — F(n — 2), where F(n) denotes the n-th member of the Fibonacci sequence (F(0) = 0,_F(1) = F(2) = 1). Since (K(2),K(3)) = 1, wehaveforn > 3 similarly as in the Fibonacci sequence
(K(n),K(n-l)) = (K(n)-K(n-l),K(n-l)) = (K(n-2),K(n-l)) = ■ ■ ■ = 1.
Let us now show that for every even n = 2a is P(n) = K(a)2 relatively prime with both P(n + 1) = K(2a + 1) and P(n — 1) = K(2a — 1). For this the following is enough: for a > 2 we have
(K(a),K{2a + 1)) = (K(a),F(2)K(2a) + F(l)K(2a - 1)) = (K(a), F{3)K{2a - 1) + F(2)K(2a -2) = ... =   (K(a), F(a + l)K(a + 1) + F{a)K{a)) = (K(a), F(a + 1)) = (F(a + 2) - F(a - 2),F{a + 1)) =   (F(a + 2) - F(a + 1) - F(a - 2),F(a + 1)) = (F(a) - F(a - 2), F(a + 1)) =   (F(a-l),F(a + l)) = (F(a-l),F(a)) = l, and
(K(a),K(2a - 1)) = (K(a),F(2)K(2a - 2) + F(l)K(2a - 3)) = (K(a), F{3)K{2a - 3) + F{2)K{2a - 4)) = • • • = (K{a),F{a)K{a) + F(a - l)K{a - 1)) = (K(a), F(a - 1)) = (F(a + 2) - F(a - 2),F(a - 1)) =   (F(a + 2) - F(a),F(a - 1)) = (F(a + 2) - F(a + 1), F(a - 1)) = (F(a),F(a - 1)) = 1.
This proves the claim. □
1.G.28. How much money do I save in a building savings in five years, if I invest in it 3000 Kc monthly (at the first day of the month), the yearly interest rate is 3% and once a year I obtain a state donation of 1500 Kc (this donation comes at first of May)?
Solution. Let xn be the amount of money at the account after n years. Then (for n > 2) we obtain the following recurrent formula (assuming that every month is exactly one twelfth of a year)
xn+1 = 1.03xn + 36000 + 1500 + 0.03 ■ 3000 ( 1 + ^ + ■ ■ ■ +     j + 0.03 ■ ^ ■ 1500   = 1.03x„ + 38115.
interests from deposits this year     interest from the state donation credited at this year
Therefore
n-2
xn = 38115 ^(1.03)J + (1.03)n_1a;i + 1500,
i=0
while xx = 36000 + 0.03 ■ 3000 (l + ±1 + ... + i) = 36585, in total x5 = 38115 ((L03) - 1V(i.03)4-36585+1500 = 202136.
V   °-03   / □
53
CHAPTER 1. INITIAL WARM UP
Remark. In reality, interests are computed according to the number of days the money is on the account. You should obtain a real bank statement of a building savings, determine its interest rates and try to compute the credited interests in a year. Compare the result with the sum that was credited in reality. Compute until the numbers agree ...
1.G.29.   What is the maximum number of areas the plane can be divided into by n circles? Solution. For the maximum number pn of areas we derive a recurrent formula
Pn+l =Pn + 2n.
Note that the (n + l)-th circle intersects n previous circles in at most 2n points (and this can really occur)
A?PMQ> WE TMZP CIRCLE
Clearly pi = 2. Thus forp„ we obtain
pn = pn-x + 2(n - 1) = p„_2 + 2(n - 2) + 2(n - 1) = ...
n-1
= pi + ^ 2i = n2 - n + 2.
1.G.30.   What is the maximum number of areas a 3-dimensional space can be divided into by n planes?
Solution. Let the number be rn. We see that r0 = 1. Similarly to the exercise (1.B.4) we consider n planes in the space, we add another plane ad we ask what is the maximum number of new areas. Again it is exactly the number of areas the new plane intersects. How many can that be? The number of areas intersected by the (n + l)-th plane equals to the number of areas the new (n + l)-th plane is divided into by the lines of intersection with the n planes that were already situated in the space. However, there are at most 1/2 ■ (n2 + n + 2) of those (according to the exercise in plane), thus we obtain the recurrent formula
n2 + n + 2 rn+i =rn-\----.
54
CHAPTER 1. INITIAL WARM UP
This equation can be again solved directly:
(n - l)2 + (n - 1) + 2               n2-n + 2 rn   =   rn-x H---- = r„_i H----
(n-l)2-(n-l) + 2     n2-n + 2 n2     (n — l)2     n     (n — 1)
n2     (n-1)2     (n-3)2     n     (n - 1)     (n - 2)
=   r„_3 H---h--— +---------- ----
2 2 2 2 2 2
+1+1+1
= "■ = r» + 2E!2-2D+E1
t=l t=l i=l
n(n + l)(2n + l)     n(n + l)
=   1 + —--------J-+n =
12 4
n3 + 6n + 5 6 '
where we have used the known relation
.2 _ n(n + l)(2n + 1)
E
6
which can be easily proved by mathematical induction. □ 1.G.31. What is the maximum number of areas a 3-dimensional space can be divided into by n balls? O
1.G.32. What is the number of areas a 3-dimensional space is divided into by n mutually distinct planes which all intersect a given point?
Solution. For the number xn of areas we derive a recurrent formula
xn = xn-i + 2(n - 1), furthermore x1 = 2, that is,
xn = n(n — 1) + 2.
□
1.G.33.   From a deck of 52 cards we randomly draw 16 cards. Express the probability that we choose exactly 10 red and 6 <°*v£s> black cards.
!»,.»posh Solution. We first realize that we don't have to care about the order of the cards. (In the resulting fraction we would obtain ordered choices by multiplying by 16! both nominator and denominator.) The number of all possible (unordered) choices of 16 cards from 52 is (^g). Similarly, the number of all choices of 10 cards from 26 is equal to (2[j) and of 6 cards from 26 is (266). Since we are choosing independently 10 cards from 26 red and 6 cards from 26 black, using the (combinatorial) rule of product we obtain the result
/•26W26N
-^--0.118. D
1.G.34. In a box there are 7 white, 6 yellow and 5 blue balls. We draw (without returning) 3 balls randomly. Determine the probability that exactly 2 of them are white.
Solution. In total there are (7+g+5) ways, how to choose 3 balls. Choosing exactly two white allows Q choices of two white balls and simultaneously (Y) choices for the third ball. Using the rule of product is the number of ways how to choose exactly two white equal to Q ■ (Y) ■ Thus the result is
(x38)   ~ U-ZS6- □
55
CHAPTER 1. INITIAL WARM UP
1.G.35. When throwing a dice, eleventh times in a row the result was 4. Determine the probability that the twelfth roll results in 4.
Solution. The previous results (according to our assumptions) do not influence the result of further rolls. Thus the probability
is 1/6. □
1.G.36. From a deck of 32 cards we randomly draw 6 cards. What is the probability that all of them have the same colour? Solution. In order to obtain the result
" ^ = 1.234- 10"4,
we just first choose one of the 4 colours and realize that there are (®) ways how to choose 6 cards from 8 cards of this colour.
□
1.G.37. Three players are given 10 cards each and two remain (from a deck of 32 cards, where 4 of them are aces). Is it more likely, that somebody receives seven, eight and nine of spades; or that two aces remain?
Solution. Since the probability that some of the players receives the three mentioned cards equals
while the probability that two aces remain equals
ill
it is more likely that some of the players receives the three mentioned cards. Let us note that proving the inequality
3-(279K (2)
a (?)
is possible by transforming both sides, where by repetitive crossing-out (after expanding the binomial coefficients according to their definition) we easily obtain 6 > 1. □
1.G.38. We throw n dice. What is the probability that among the numbers that appeared the values 1, 3 and 6 are not present?
Solution. We can reformulate the exercise that we throw the dice n times. The probability that the first roll does not result into 1, 3 or 6 is 1/2. The probability that neither the first nor the second roll is clearly 1/4 (the result of the first roll does not influence the result of the second roll). Since the event determined by the result of a given roll and event determined by the result of another roll are always (stochastically) independent, the probability is 1/2™. □
1.G.39. Two friends are shooting independently of each other at one target - one shoots, then the second shoots, then the first, and so on. The probability that the first hits is 0.4, the second friend has the probability of hitting 0.3. Determine the probability P of the event that after shooting there will be exactly one hit of the target.
Solution. We determine the result by summing the probabilities of two mutually exclusive events - first friend hit the target and the second has not; and second friend hit the target and first has not. Since the events of hitting are independent (note that independence is preserved when taking complements) is the probability given by the product of the probabilities of given elementary elements. That is,
p = 0.4 ■ (1 - 0.3) + (1 - 0.4) ■ 0.3 = 0.46. □
56
CHAPTER 1. INITIAL WARM UP
1.G.40.   We flip three coins twelve times. What is the probability that at least one nipping results in three tails?
Solution. If we realize that when repeating the flipping, the individual results are independent, and denote for i e {1,..., 12} by Ai the event „the i-th flipping results in three tails", we are determining
P ( lU) = 1 - (1 - P(A±)) ■ (1 - P(A2))    (1 - P(A12)).
Foreveryi G {1,..., 12} is P(Ai) = 1/8, since at every coin of the three the tail is with the probability 1/2 independently of the results of the other coins. Now we can write the final probability
Ks)   ■ n
1.G.41. In a particular state there is a parliament with 200 members. Two major political parties in this state flip a coin during an "election" for every seat in the parliament. Each of the parties has associated one side of the coin. What is the probability that each of the parties gains 100 seats? (The coin is "fair".)
Solution. There are 2200 of possible results of the elections (considered to be sequences of 200 results of flips). If each party is to obtain 100 seats, then there are exactly 100 tails and 100 heads in the sequence. There are (^j]) such sequences (since the sequence is uniquely determined by choosing 100 members of 200 possible, which will result in, say, tails). The resulting probability is
/20(n 200!
U00/   _   100M00!   _l_ n r>rfi
2200        2200 u.uou.
1.G.42. Seven Czechs and five English are randomly divided into two (nonempty) groups. What is the probability that one group consists of Czechs only?
Solution. There are 212 — 1 of possible divisions. If one group consists of Czechs only, it means that all English are in one group (either in the first or in the second). It remains to divide the Czechs into two nonempty groups, that can be done in 27 — 1 ways. In the end we must add 1 for the division which puts all English in one group and all Czechs in another,
2 ■ (27 - 1) + 1
□
1.G.43. From ten cards, where exactly one is an ace, we randomly draw a card and put it back. How many times must we do this, so that the probability that the ace is drawn at least once, is greater than 0.9?
Solution. Let A{ be the event „at i-th drawing the ace was drawn". Since the individual events A{ are (stochastically) independent, we know that
P ( II A?j = 1 - (1 - P(A±)) ■ (1 - P(A2)) ■■■ (1 - P(An)) for every n eN. We are looking for an n e N such that it holds that
P ( U A?j = 1 - (1 - P(Ai.)) ■ (1 - P(A2)) ■■■ (1 - P(An)) > 0.9. Clearly is P(Ai) = 1/10 for any i e N. Thus it is enough to solve the equation
!-(&)"> 0.9,
from which we can express
n> 1^7515' kdea>l.
Evaluating, we obtain that we must do the drawing at least twenty two times. □
57
CHAPTER 1. INITIAL WARM UP
1.G.44. Texas hold'em. Let us now solve a couple of simple exercises concerning the popular card game Texas hold'em, whose rules we will not state (if the reader does not know them, she can look them up on the Internet). What is the probability that
i) the starting combination is a tuple of the same symbols?
ii) in my starting tuple of cards there is an ace?
iii) in the end I have one of the six best combinations of cards?
iv) I win, if I hold in my hand ace and a triple of twos (of any colour), on the flop there is ace and two twos and on the turn there is a third three and all these four cards have distinct colour? (The last card river is not yet turned)
Solution.
i) The number of distinct symbols is 13 and there are always four of them (one of each colour). Thus the number of tuples with the same symbols is 13(g) = 78. The number of all possible tuples is ^g4) = 1326. The probability of having same symbols is then j= = 0.06.
ii) One card is the ace, that is four choices, and the second is arbitrary, that is 51 choices. But we have counted twice the tuples with two aces, of which there are (2) = 6. Thus we obtain 4-51 — 6 = 198 tuples and the probability is
.198. r, -,c 1326 ~~ U'i0-
iii) Let us compute the probabilities of the individual best combinations:
ROYAL FLUSH: There are exactly only four such combinations - one of each colours. The number of combinations of five cards are (552) = 2598960. The probability is thus equal to 1.5 ■ 10"6. Very small:)
STRAIGHT FLUSH: Sequence which ends with the highest card in the range 6 to K, that is eight choices for every colours. We obtain 259382960 = 1.2 ■ 10"5.
POKER: Four identical symbols - 13 choices (for every symbol one). The fifth card can be arbitrary, that is 48 choices. That makes y§fgo = 2.4 ■ 10"4.
FULL HOUSE: Three identical symbols make 13 Q = 52 choices and two identical symbols make 12 (2) = 72 choices. The probability is 2|974460 = 1.4 ■ 10-3.
FLUSH: All five cards of the same colour means 4 (13) = 5148 choices and the probability is then 255g896o ^ 2 ' 10~3. STRAIGHT: The highest card of the sequence is in the range from 6 to Ace, that is 9 choices. The colour of every card is arbitrary, that makes 9 ■ 45 = 9216 choices. But we have counted both straight flush and royal flush which we must subtract.
For determining the probability of one of the six best combinations we don't have to do that, we just do not count the first two combinations. Therefore we obtain the probability approximately 3.5 ■ 10~3 + 2 ■ 10~3 + 1.4 ■ 10~3 + 2.4 ■ 10~4 = 7.14- 10"3.
iv) The situation is clearly pretty good and therefore it will be better to count bad situation, that is, when the opponent has even better combination. I have at this moment full house of two aces and three two's. The only combination that could beat me at this moment is either full house of three aces and two twos or a poker of twos. That means that the enemy must have either the ace or the last two. If he has the two and any other card, then he clearly wins no matter what card is
river. How many ways are there for this other card in his hand? 3 + 4 H-----1-4 + 2 = 45 (one triple and two aces cannot
be in his hand since I have them). There are (426) = 1035 remaining combination and the probability of such loss is then 0.043. If he has an ace in his hand, then the following can happen. If he holds two aces, then he again wins if two is not on the river - then I would have split poker. The probability of my (conditional) loss is then ■ || = 10~3. If the enemy has in his hand ace and some other card than 2 and A, then it is a draw no matter what is on the river. The total probability of the win is thus almost 96 %. ^
1.G.45. A volleyball team (with libera, that is, 7 people) sits after a match in a pub and drinks beer. But there is not enough mugs, and thus the publican keeps using the same seven. What is the probability that
58
CHAPTER 1. INITIAL WARM UP
i) exactly one person does not receive the mug he had last round,
ii) nobody receives the mug he had last round,
iii) exactly three receive the mug they had last round.
Solution.
i) If six people receive the mug they had last round, then clearly the seventh person also receives the mug he had last round, the probability is thus zero.
ii) Let M is the set of all orderings and event Ai occurs when the i-th person receives his mug from last round. We want to calculate\M -UtAt\. We obtain 7! Y1=o ^Ti^ = 1854 And the probability is |§f^ = §§§ = 0-37.
iii) We choose which three receive the mug they had last round - Q) = 35 choices. The remaining four must receive mugs from somebody else. That is again the formula from the previous section, specifically it is 4! Y^t=o ^~k\ = ® choices. In total we have 9 ■ 35 = 315 choices and the probability is       — Jq- q
1.G.46. In how many ways can we place n identical rooks on a chessboard n x n such that every non-occupied position is threatened by some of the rooks?
Solution. Such placements are a union of two sets: the set of placements where in at least one row there is one rook (therefore in every row there is exactly one; this set has nn elements - in every row we choose independently one position for the rook), and the set of placements where in every column there is at least one (that is exactly one) rook (as before, this set has nn elements). The intersection of these sets has n\ elements (the places for the rooks are chosen sequentially starting in the first row - there we have n choices, in the second only n—1- one column is already occupied...). Using the inclusion-exclusion principle, we obtain
2nn-n\.
1.G.47.   Determine the probability that when throwing two dice at least one resulted in four, if the sum is 7.
Solution. We solve this exercise using the classical probability, where the condition is interpreted as restriction of the probability space. The space has due to the condition 6 elements, and exactly 2 of those are favourable to the given event. The answer is thus 2/6 = 1/3. □
1.G.48. We throw two dice. Determine the conditional probability, that the first die resulted in five under the condition that the sum is 9. Based on this result, decide whether the events "first dice results in five" and "the sum is 9" are independent.
Solution. If we denote the event "first dice resulted in five" by A and the event "the sum is 9" by H, then it holds
p(A\m — p(AnH) _ A _ i
Note that the sum 9 occurs when the first die is 3 and the second 6, the first is 4 and the second 6, the first is 5 and the second is 4, or the first is 6 and the second is 3. Of those four results (that have the same probability) only one is favourable to the event A. Since the probability of A is clearly 1 /6 =^ 1 /4, the events are not mutually independent. □
1.G.49. Let us have a deck of 32 cards. If we draw twice one card, what is the probability that the second drawn card is an ace, if we return the first card; and when we don't return the first card (then there are 31 cards in the deck).
Solution. If we return the card in the deck, we are just repeating the experiment, which has 32 possible results (which have the same probability), and exactly four of them are favourable. Thus we see that the probability is 1/8. In the second case when we do not return the card, is probability also the same. It is enough to consider that when drawing all the cards one by one is the probability of the ace as the first card identical to the probability that the ace is the second card. We could also use conditional probability, that results into
i_   _3_  ,   28     i_ 1
32 ' 31 ť 32 ' 31       8' □
59
CHAPTER 1. INITIAL WARM UP
1.G.50. Consider families with two children and for simplicity assume that all choices in the set fl = {bb, bg, gb, gg}, where b stands for „boy" and g stands for „girl" (considering the age of the children) have the same probability. Choose random events
Hi - family has a boy,   A\ - family has two boys.
Compute P{Ax\B:1).
Similarly consider families with three children, where
Q = [bbb, bbg, bgb, gbb, bgg, gbg, ggb, ggg}.
If
H2 - the family has both boy and girl,   A2 - the family has at most one girl, decide whether the events A2 and H2 are independent.
Solution. Considering which of the four elements of the set fl are (not) favourable to the event Ai or Hi, we easily obtain
P (A,\H,) - P(/Lingl) - P(/Ll) - i - I Further we have to determine whether the following holds:
P (A2 nH2) = P (A2) ■ P (H2).
Again we just have to realize that exactly the elements bbb, bbg, bgb, gbb of the set fl, are favourable to the event A2; to the event H2 the elements bbg, bgb, gbb, bgg, gbg, ggb are favourable and to the event A2 n H2 the elements bbg, bgb, gbb. Therefore
P(A2nH2) = l = l-l = P(A2)-P(H2), which means that the events A2 and H2 are independent. □
1.G.51. We flip a coin five times. For every head, we put a white ball in a hat, for every tail we put in the same hat a black ball. Express the probability that in the hat there is more black balls than white balls, if there is at least one black ball in the hat.
Solution. Let us have the following two events
A - there are more black balls than white balls in the hat, H - there is at least one black ball in the hat.
We want to expressP(vl|.H). Note that the probability P (Hc) of the complementary event to the event H is 2~5 and that the probability of the event is the same as the probability P (Ac) of the complementary event (there are more white balls in the hat). Necessarily, P(H) = 1 - 2"5, P(A) = 1/2. Furthermore P(A n H) = P(A), since the event H contains the event A (the event A has H as a consequence). Thus we have obtained
p(A\m - p(AnH) -   i   - is
f{A\U) -    p{H)     - - 31. n
1.G.52. In a box there are 9 red and 7 white balls. Sequentially we draw three balls (without returning). Determine the probability that the first two are red and the third is white.
Solution. We solve this exercise using the theorem about multiplication of probabilities. First we require a red ball, that happens with the probability 9/16. If a red ball was drawn, then in the second round we draw a red ball with the probability 8/15 (there are 15 balls in the box, 8 of them are red). Finally, if two red balls were drawn, the probability that a white ball is drawn is 7/14 (there are 7 white balls and 7 red balls in the box). Thus we obtain
. A . X - n 15
16    15    14 u.xu.
60
CHAPTER 1. INITIAL WARM UP
1.G.53. In the box there are 10 balls, 5 of them are black and 5 are white. We will sequentially draw the balls, and we do not return them back. Determine the probability that first we draw a white ball, then a black, then a white and in the last, fourth turn again a white.
Solution. We use the theorem about multiplication of probabilities. In the first round we draw a white ball with the probability 5/10, then a black ball with probability 5/9, then a white ball with probability 4/8 and in the end a white ball with probability 3/7. That gives
JL    5    4    3 _ _5_
10    9    8    7       84' □
1.G.54.   From a deck of 32 cards we randomly draw six cards. Compute the probability that the first king will be chosen as the sixth card (that is, the previous five cards do not contain any king). Solution. Using the theorem about multiplication of probabilities we have
28    27    26    25    24    J^_^n 0790
32 ' 31 ' 30 ' 29 ' 28    27      U.UIZO. q
1.G.55.   What is the probability that a sum of two randomly chosen positive numbers smaller than 1 is smaller than 3/7? Solution. It is clear that it is a simple exercise on geometrical probability where the basic space fl is a square with vertices at [0, 0], [1, 0], [1,1], [0,1] (we are choosing two numbers in [0,1]). We are interested in the probability of the event that a randomly chosen point [x,y] in this square satisfies x + y < 3/7, that is, the probability that the point lies in the triangle A with vertices at [0,0], [3/7,0], [0,3/7]. Now we can easily compute
P( A'\ — mlA — (f )2/2 — J_
r\J±) ~ volfi —      1      — 98' □
1.G.56. Let a pole be randomly broken into three parts. Determine the probability that the length of the second (middle) part is greater than two thirds of the length of the pole before the breaking.
Solution. Let d stand for the length of the pole. The breaking of the pole at two points is given by the choice of the points where we split the pole. Let x be the point which is the first (closer to left end of the pole), and x + y be the point where the second splitting occurs. That says that the basic space is the set {[x, y]; x e (0, d),y e (0, d — x)}, that is, a triangle with vertices at [0,0], [d, 0], [0, d\. The length of the middle part is given by the value of y. The condition from the exercise statement can be now restated as y > 2d/3, which corresponds to the triangle with vertices at [0,2d/3], [d/3,2d/3], [0, d\.
Areas of the considered triangles are d2/2 a (d/3)2/2, therefore the probability is
a2
32.2   _ 1
^   "9- □
1.G.57. A pole of length 2 m is randomly divided into three parts. Determine the probability of the event that the third part is shorter than 1,5m.
Solution. This exercise is for using the geometrical probability, where we are looking for the probability that the sum of the lengths of the first two parts is greater than one fourth of the length of the pole. We determine the probability of the complementary event, that is, the probability that if we randomly choose two points on the pole, both of them are in the first quarter of the pole. The probability of this event is 1/42, since the probability of picking a point in the first quarter of the pole is clearly 1/4 and this choice is independently repeated (once). Thus the probability of the complementary event is 15/16. □
1.G.58. Mirek and Marek have a lunch at the school canteen. The canteens opens from 11 to 14. Each of them eats the lunch for 30 minutes, and the arrival time is random. What is the probability that they meet at a given day, if they always sit at the same table?
Solution. The space of all possible events is a square 3x3. Denote by x the arrival time of Mirek and by y the arrival time of Marek, these two meet if and only if |a; — y\ < 1/2. This inequality determines in the square of possible events the area whose volume isll/36ofthe volume of the whole square. Thus that is also the probability of the event. □
61
CHAPTER 1. INITIAL WARM UP
1.G.59. From Brno Honza rides a car to Prague randomly between 12 and 16, and in the same time interval Martin rides a car to Brno from Prague. Both stop in a motorest in the middle of the trip for thirty minutes. What is the probability that they meet there, if Honza's speed is 150 km/h and Martin's is 100 km/h? (The distance Praha-Brno is 200 km).
Solution. If we denote the departure time of Martin by x and the departure time of Honza by y, and in order to have fewer fractions in the following calculations choose a time unit to be ten minutes, then the base space is a square 24 x 24. The arrival time of Martin to the motorest is x + 6, arrival time of Honza is y + 4. As in the previous exercise, the event that they meet in the motorest is equivalent to the event that their arrival times do not differ by more than thirty minutes, that is, | (a; + 6) — (y + 4) | < 3. This condition determines an area with volume 242 — \ (232 + 192) (see the figure)
and the probability
P
242 -i(232 + 192) 131
242
576
0.227.
,-1     j ^^-u^r
□
1.G.60. Mirek departs randomly between 10 and 20 o'clock from Brno to Prague. Marek departs randomly in the same interval from Prague to Brno. The trip takes 2 hours. What is the probability that they meet on the road (they use the same road)?
Solution. We are solving analogously to the previous exercise. The space of all events is a square 10 x 10, Mirek, departing at the time x, meets Marek, departing at the time y if and only if \x — y | < 2. The probability is p =      = ^- = 0.36.
□
1.G.61. Two meter-long pole is randomly divided into three pieces. Determine the probability that a triangle can be built of the pieces.
62
CHAPTER 1. INITIAL WARM UP
Solution. Division of the pole is given as in the previous exercises by the points of cutting x and y and the probability space is again a square 2 x 2. In order to be able to build a triangle of the pieces, the lengths of the parts must satisfy the triangle inequalities, that is, sum of lengths of any two parts must be greater than the length of the third part. Since the sum of the lengths is 2 meters, this condition is equivalent to the condition that each part must be smaller than 1 meter. Using the cut-points x and y, we can express this that it cannot simultaneously hold x < 1 and y < 1 or simultaneously x > 1 and y > 1 (this corresponds to the conditions that the border parts of the pole are smaller than 1), and also \x — y < 1 (the middle part is smaller than one). These conditions are satisfied by the shaded area in the picture, whose volume is 1/4.
□
1.G.62.   Does the equations
(a)
(b)
(c)
have a unique solution (that is, exactly one)?
Solution. The set of equation is uniquely solvable if and only if the determinant of the matrix given by the left-hand side coefficients is nonzero. Therefore, the coefficients on the right-hand side do not influence the uniqueness of the solution. Thus we have to have the same answer in (a) and (b). Since
4 -y/3
4xi	- V3x2	= 3,
	- 2^x2	= -2
4xi	- V/3x2	= 16,
Xl	- 2^x2	= -7
4xi	+ 2x2	= 7,
-2xi	- x2	= -3
1 -2y/7
4 2
-2 -1
4- (-2y/7) - (-V3-1) 7^0, = 4-(-l)-(2-(-2)) = 0,
for (a) and (b) there is a unique solution and in (c) there is not. If we multiply the second equation in (c) by —2, we see that it has no solution at all. □
1.G.63.   Determine A-Afar
Solution. We know that the mapping
.     / cos 09   — sin 09 \ A =    . ,   where 09 e
V sin 09    cos 09 '
x\ (cos 09 — sin 09 y I      \ sin 09    cos 09
x,y G
is the rotation of the plane R2 around the origin through the angle 09 in the positive direction. Since matrix multiplication is associative, we obtain that the mapping
63
CHAPTER 1. INITIAL WARM UP
x \ (cosp — sinp\ (cos p — sinp\ (x\ yy      Vsmy     cosy /   Vsmy    cosy I   \y I
A-A =
is a rotation through the angle 2p. That means that
/ cos 2p — sin 2p\ I sin 2y    cos 2y J
Note that we could have directly multiplied A ■ A (and apply the formulas for sine and cosine of double angle). But repeating the aforementioned method (or using the mathematical induction) yields
A"=(coamp   -^Vl     n = 2,3,...,
smny cosny
easier (we set A2 = A ■ A, A3 = A ■ A ■ A, etc.). □
1.G.64. The parallelogram identity. The calculation in coordinates can be useful in plane geometry. Let us demonstrate this on the proof "parallelogram identity": iiu,v e R2, then:
2(||u||2 + ||u||2) = ||u + u||2 + ||u-u||2.
Thus the sum of the squares of the diagonals of a parallelogram is the sum of the squares of the lengths of the four sides of the parallelogram.
Solution. Writing both sides of the equation into the coordinates u=      u2), v = (i>i, 1*2) yields: \\u + v\\2 + \\u - v\\2
= (ui + Vl)2 + (u2 + v2)2 + (Ul - Vl)2 + (u2 - v2)2 = u\ + 1u\v\ + vl + u2 + 2u2v2 + v2 + + u2 — 2wii>i + v\ + u2 — 2u2v2 + v2 = 2{ul + u22 + v2 + vl) = 2{\\u\\2 + \\v\\2).
□
1.G.65.   Compute the area S of a quadrilateral given by the vertices
[0,-2],   [-1,1],   [1,5], [1,-1].
Solution. In the usual notation
A =[0,-2],   B=[l,-1],   C=[l,5], £»=[-1,1] and the usual division of the quadrilateral into triangles ABC and ACD with areas Si and 6*2 we obtain
S = S*i + 5*2
1-0 1-0 ! -1+2   5+2 2
1-0 -1-0 5+2 1+2
i(7-l) + i(3 + 7) = 8.
□
1.G.66.   Determine the area of the quadrilateral ABCD with vertices A = [1,0], B = [11,13], C = [2,5] a D = [-2, -5].
Solution. We divide the quadrilateral into two triangles ABC and ACD. We compute their areas by computing absolute values of the determinants, see 1.5.11,
1	1	5			1	1 5	
				+			
2	10	13			2	-3 -5	
64
CHAPTER 1. INITIAL WARM UP
1.G.67.   Compute the area of parallelogram with vertices at [5,5], [6,8] at [6,9].
Solution. Although such parallelogram is not uniquely determined (the fourth vertex is not given), the triangle with vertices
at [5, 5], [6, 8] and [6,9] must be necessarily a half of every parallelogram with these three vertices (one of the sides of the
triangle becomes the diagonal of the parallelogram). Therefore the area equals the determinant
6-5 6-5 8-5 9-5
3   I =1"4-1-3 = 1. D
1.G.68. Give the area of a meadow, which is determined on the area map by the points at positions [—7,1], [—1,0], [29,0], [25,1], [24,2] and [17,5]. (Ignore the measurement units. They are determined by the ratio of the area map to the reality.)
Solution. The given hexagon can be divided into four triangles with vertices at
[-7,1], [-1,0], [17, 5];       [-1,0], [24, 2], [17, 5];
[-1,0], [25,1], [24, 2];       [-1, 0], [29, 0], [25,1].
The areas are 24, 89/2, 27/2 and 15 respectively, which gives the total area as
24 + 44 i + 13 i + 15 = 97.
2        2 n
1.G.69. Determine the area of a triangle A2A3A11, where AqAi ... An are the vertices of a regular dodecagon inscribed in a circle of radius 1.
Solution. The vertices of the dodecagon can be identified with the twelfth roots of 1 in the complex plane. As in ?? we find
out A2 = cos(7r/3) + 2 sin(7r/3) = 1/2 + i\/3/2, A3 = cos(7r/2) + 2 sin(7r/2) = 2, An = cos(—7r/6) + 2 sin(—7r/6) = V/3/2 — i/2, that means the that the coordinates of these points in the complex plane are A2 = [1/2, v/3/2], A3 = [0,1], ^■11 = [a/3/2, — ■§]. According to the formula for the area of a triangle, the area of the triangle S is
1	A2	-An	1
2	A3	-An	~ 2
1 _ \/3 1   1 \/3
2 2 2 +2 V3 3
2 2
_s/3
4
□
1.G.70. Determine which sides of the quadrilateral with vertices A = [95,99], 5 = [130,106], C = [40,60],D = [130,120], are visible from the point [2,0]. O
1.G.71.   Determine the number of relations over the set {1,2,3,4}, which are both symmetric and transitive.
.2/
Solution. Relations of the given properties is an equivalence over some subset of the set {1,2,3,4}. In total, 1 + 41 +
2 + Q ■ 5 + 15 = 52. □
1.G.72. Determine the number of ordering relations over a three-element set. O
1.G.73. Determine the numer of ordering relations over the set {1,2,3,4} such that the elements 1 and 2 are not comparable (that is, neither 1x2 nor 2x1, where X stands for the ordering relation). O
1.G.74.   Determine the number of surjective mappings / from the set {1,2,3,4,5} to the set {1,2, 3} such that/(l) = /(2).
Solution. Every such mappings is uniquely given by the images of the elements {1, 3,4,5}, there are exactly that many mappings as there are surjective mappings of the set {1,3,4,5} to the set {1, 2,3}, that is, 36, as we know from the previous exercise. □
65
CHAPTER 1. INITIAL WARM UP
1.G.75.   Give all the elements in S o R, if
B = {(2,4),(4,4),(4,5)}cNxN, S = {(3,1), (3,2), (3,5), (4,1), (4,4)}cNx N.
Solution. Considering all choices of two ordered tuple
(2, 4), (4,1);    (2,4), (4,4);    (4, 4), (4,1);    (4,4), (4,4)
satisfying that the second element of the first ordered tuple—which is a member of R—equals the first element of the second ordered tuple—which is a member of S—we obtain
S*oJR = {(2,l),(2,4),(4,l),(4,4)}.
□
1.G.76.   Let a binary relation be given
ä={(0,4),(-3, 0), (5, tt), (5,2), (0,2)} between sets A = Z a B = R. Express R-1 and R o R-1. Solution. We can immediately see that
R-1 = {(4, 0), (0, -3), (tt, 5), (2, 5), (2, 0)}.
Furthermore,
R o iT1 = {(4, 4), (0, 0), (tt, tt), (2, 2), (4, 2), (tt, 2), (2, tt), (2,4)}.
1.G.77.   Decide whether the relation R determined by the condition:
(a) (a, b) e R \a\ < \b\;
(b) (a,b) e R \a\ = \2b\
over the set of integers Z is transitive.
Solution. In the first case R is transitive, because
I a I < I & |, I & I < I c I I a I < I c |.
In the second case R is not transitive. For instance, consider
(4,2),(2,l)eB, (4,l)gÄ.
□
□
1.G.78.   Find all relations over M = {1,2}, which are not antisymmetric. Which of them are transitive?
Solution. There are four relations that are not antisymmetric. They are exactly subsets of the set {1,2} x {1,2}, which contain the elements (1,2), (2,1) (otherwise the condition of antisymmetry is satisfied). Of these four only the relation
{(1,1), (1,2), (2,1), (2, 2)} = MxM,
is transitive, because not containing tuples (1,1) and (2,2) in a transitive relation means that the relation cannot contain both
(1,2) and (2,1). □
1.G.79. We have a set {3,4,5, 6,7}. Write explicitly the relations
i) a divides b,
ii) Either a divides & or & divides a,
iii) a and b have a common divisor greater than one,
and examine their properties. O
66
CHAPTER 1. INITIAL WARM UP
1.G.80.   Is there an equivalence relation, which is also an ordering, over the set of all lines in the plane?
Solution. An equivalence relation (or ordering relation) must be reflexive, therefore every line must be in relation with itself. Furthermore we require that the relation is both symmetric (equivalence) and antisymmetric (ordering). That means that a line can be in relation only with itself. If we define the relation such that two lines are in relation if and only if they are identical, we obtain "very natural" relation which is both equivalence relation and ordering. We just need to check that it is transitive, which it trivially is. Thus the only relation satisfying the problem statement is the identity over the set of all lines in the plane. □
1.G.81.   Determine whether the relation
R = {(k,l) e Z x Z; | k | > | /1} over the set Z is an equivalence and/or an ordering.
Solution. The relation R is not an equivalence: it is not symmetric (take (6,2) e R, (2, 6) ^ R); it is not an ordering: it is not antisymmetric (take (2, -2) e R, (-2, 2) e R). □
1.G.82. Show that the intersection of any equivalence relation over a set X is again an equivalence relation, and that the union of two ordering relations over a set X does not have to be an ordering.
Solution. We see that the intersection of equivalence relations is reflexive, symmetric and transitive: all the equivalence relations must contain the tuple (x, x) for every x e X, therefore the intersection contains that tuple too. If the element (x, y) is in the intersection, then the element (y, x) is also in the intersection (just use the fact that every equivalence is symmetric). If tuples (x, y) and (y, z) axe in the intersection, then both are in the equivalences also. Since the equivalences are transitive, they all contain the element (x, z) and thus that element is also in the intersection. If we chose X = {1,2} and the ordering relation
R1 = {(1,1),(2,2),(1,2)},   JR2 = {(1,1),(2,2),(2,1)}
over X, we obtain the relation
R!UR2 = {(1,1), (2, 2), (1,2), (2,1)}, which is not antisymmetric, thus not an ordering. □
1.G.83. Over the set M = {1,2,..., 19,20} there is an equivalence relation ~ such that a ~ b for any a, b e m if and only if the first digits of the numbers a, b are the same. Construct the partition given by this equivalence.
Solution. Two numbers from the set M are in the same equivalence class if and only if they are in the relation (first digit is the same). Therefore the partition consists of the sets
{1,10,11,..., 18,19}, {2,20}, {3}, {4}, {5}, {6}, {7}, {8}, {9}.
1.G.84. We are given partition of two classes {b, c}, {a, d, e} of the set X = {a, b, c, d, e}. Write down the equivalence relation R over the set X which gives this partition.
Solution. Equivalence R is determined by the fact that the two elements are in relation if and only if they are in the same partition class (note also that R must be symmetric), and every element is in relation with itself (R must be reflexive). Therefore R contains exactly
(a,a),(b, b),{c,c),{d, d),(e,e), (b, c), (c, 6), (a, d), (a, e), (d, a), (d, e), (e, a), (e, d).
67
CHAPTER 1. INITIAL WARM UP
1.G.85.   Let {a, b, c, <i} be a set with a relation
{(a, a), (b, b), (a, b), (b, c), (c, 6)}.
What is the minimal number of elements we have to add to the relation in order to make it an equivalence?
Solution. Let us successively ensure the three properties that define an equivalence. First it is the reflexivity. We must add the tuples {(c, c), (d, d)}. Second is the symmetry - we must add (b, a) and for the third step we must do the so-called transitive closure. Since a is in relation with b and b is in relation with c, we must add (a, c) and (c, a). □
1.G.86. What is the maximal domain D C R and codomain H C R such that the following mappings are bijective, and what is then the inverse function?
i) X H> XA
ii) x H> a;3
Hi)  X l-> —^-r
Solution.
i) D = [0, oo) and H = [0, oo) or also D = (-co, 0] a ii = [0, oo). The inverse function is then x i-> -^x.
ii) Z) = ií = R and the inverse function is x i->
iii) D = R \ {-1} and H = R \ {0}. The inverse function is x i-> ± - 1. n
1.G.87.   Consider a relation R x R. A point is in the relation whenever it holds that
(a;-l)2 + (y + l)2 = l. Can we describe the points using the function y = j(x)l Depict the points in the relation.
Solution. We cannot, because for instance y = — 1 has two preimages: x = 0 and x = 2. The points lie on a circle with the centre at the point (1,-1) and radius 1. □
1.G.88. Let for any two integers k, I hold that (k, I) e R whenever the number 4k — 41 is an integral multiple of 7. Is such a relation over R an equivalence? Is it an ordering?
Solution. Note that two integers are in the relation R if and only if they have the same remainder under the division by 7. Therefore it is an example of the so-called remainder class of integers. Therefore we know that the relation R is an equivalence relation. Its symmetry (for instance, (3,10), (10,3) G R, 3 ^ 10) implies that it is not an ordering. □
1.G.89. Let a relation R be defined over the set iV = {3,4,5,..., n, n + 1, ... }, such that two numbers are in the relation whenever they are relatively prime (that is, the prime decompositions of the numbers do not contain any common number). Determine whether this relation is reflexive, symmetric, antisymmetric, transitive.
Solution. For a tuple of the same numbers it holds that (n,n) <£ R. Therefore the relation is not reflexive. It is clear that when two numbers are relatively prime or not, it does not matter how they are ordered - it is a property of unordered tuples. Therefore, R is symmetric. From the symmetry we have that it is not antisymmetric (for instance, (3,5) G R, 3 ^ 5). Since R is symmetric and (n, n) <£ R for any number n G N, a choice of two distinct numbers which are in the relation gives that R is not transitive. □
1.G.90.   Determine the number of injective mappings of the set {1,2,3} to the set {1,2,3,4}.
Solution. Any injective mapping among the given sets is given by choosing an (ordered) triple from the set {1,2,3,4} (the elements in the chosen triple will correspond in order to images of the numbers 1,2,3) and vice versa. Every injective mapping gives such a triple. Thus the number of injective mappings equals the number of ordered triples among four elements, that is
v(3,4) =4-3-2 = 24. □
68
CHAPTER 1. INITIAL WARM UP
1.G.91.   How many relations are there over an n-element set?
Solution. A relation is an arbitrary subset of the cartesian product of the set with itself. This cartesian product has n2
2
elements, thus the number of all relations over an n-element set is 2™ . □ 1.G.92.   How many reflexive relations are there over an n-element set?
Solution. The relation over the set M is reflexive if and only if it has the diagonal relation Am = {(a, a), all a e M} as a subset. As for the rest of the n2 — n ordered pairs in the cartesian product M x M, we have independent choice, whether or
2
not the pair belongs to the relation. In total we have 2™ ~n different reflexive relations over an n-element set. □ 1.G.93.   How many symmetric relations are there over an n-element set?
Solution. A relation R over the set M is symmetric if and only if the intersection of R with each {(a, 6), (6, a), where a ^ b, a,b e M} is either the whole two-element set or is empty. There are (™) two-element subsets of the set M. If we also declare what the intersection of R and the diagonal relation Am = {(a, a), where a e M} should be, then R is completely determined. In total we are to do (IJ) + n independent choices between two alternatives: each set of the type {(a, 6), (6, a) | where a, b e M, a ^ b} is either the subset of R or it is disjoint with R. Every pair (a, a), a e M is either in R or not. In total we have 2(2)+™ symmetric relations over an n-element set. □
1.G.94.   How many anti-symmetric relations over an n-element set are there?
Solution. A relation R over the set M is anti-symmetric if and only if the intersection of R with each set {(a, b), (b, a)}, a ^ b, a,b e M is either empty or one-element (which means that it is either {(a, 6)} or {(b, a)} but not both). The intersection of R with the diagonal relation is arbitrary. By declaring what these intersections are, the relation R is completely determined. In total we have 3(2)2™ anti-symmetric relations over an n-element set. □
1.G.95. Determine the number of ordering relations of the set {1, 2,3,4,5} such that exactly two pairs of element are incomparable, o
69
CHAPTER 1. INITIAL WARM UP
Solution to the exercises
l.B.3. yn = 2(f)" - 2. I.G.2.
l>    34   ' 34
ii) 2^*- The first result is obtained by expanding the fraction by 5 — 3i.
/.G.2/.
i) 26 = 64
ii) 0 = 15
ill) No head is one possibility     = l.oneheadis (^) = 6. Thus there are 7 sequences with at most one head and the result is 64—7 = 57.
1.G.31. The maximum number j/n ofareas a plane can be divided into by n circles is j/n = j/n-i+2(n— 1), yi = 2, thatis, yn =n2—n+2.
For the maximum number pn of areas a space can be divided into by n balls we obtain the recurrent formulapn+i = pn + yn,pi = 2, that is, pn = f (n2 - 3n + 8).
1.G.70. First, we orient the vertices of the given quadrangle in the counter-clockwise order: ABC'D. After computing the corresponding
determinants as in the previous exercises we see that only the side CB is visible.
1.G.72. 19.
1.G.73. 87.
1.G.79.
i) (3, 3), (4,4), (5, 5), (6, 6), (7, 7), (3, 6), check that it is an ordering relation.
ii) again (i, i) for i = 1,..., 7 and additionally (3, 6), (6, 3), check that it is an equivalence relation.
ill) (i, i) for i = 1,..., 7 and also (3, 6), (6, 3), (4, 6), (6,4). Check that it is not an equivalence, since transitivity does not hold. 1.G.95. Three different Hasse diagrams which satisfy the given condition. In total 5! + 5! + 5!/4 = 270.
70
CHAPTER 2
Elementary linear algebra
Can't you count with scalars yet?
- no worry, let us go straight to matrices...
In the previous chapter we warmed up by considering relatively simple problems which did not require any sophisticated tools. It was enough to use addition and multiplication of scalars. In this and subsequent chapters we shall add more sophisticated thoughts and tools.
First we restrict ourselves to concepts and operations consisting of a finite number of multiplications and additions to a finite number of scalars. This will take us three chapters and only then will we move on to infinitesimal concepts and tools. Typically we deal with finite collections of scalars of a given size. We speak about "linear objects" and "linear algebra". Although it might seem to be a very special tool, we shall see later that even more complicated objects are studied mostly using their "linear approximations".
In this chapter we will work with finite sequences of jig^ scalars. Such sequences arise in real-world problems whenever we deal with objects de-.. scribed by several parameters, which we shall call coordinates. Do not try much to imagine the space with more than three coordinates. You have to live with the fact that we are able to depict only one, two or three dimensions. However, we will deal with an arbitrary number of dimensions. For example, observing any parameter in a group 500 students (for instance, their study results), our data will have 500 elements and we would like to work with them. Our goal is to develop tools which will work well even if the number of elements is large.
Do not be afraid of terms like field or ring of scalars K. Simply, imagine any specific domain of numbers. Rings of scalars are for instance integers Z and all residue classes Z^. Among fields we have seen only R, Q, C and residue classes Zfc for k prime. Z2 is very specific among them, because the equation x = — x does not imply x = 0 here, whereas in every other field it does.
A. Systems of linear equations and matrix manipulation
We approach vector spaces in a clever way. We begin with something we know - systems of linear equations and find that the vector spaces are hidden behind them.
1. Vectors and matrices
In the first two parts of this chapter, we will work with vectors and matrices in the simple context of finite sequences of scalars. We can imagine working with integers or residue classes as well as real or complex numbers. We hope to illustrate how easily a concise and formal reasoning can lead to strong results valid in a much broader context than just for real numbers.
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.A.I. A colourful example. A company of painters orders 810 litres of paint, to contain 270 litres each of red, green and blue coloured paint. The provider can satisfy this order by mixing the colours he usually sells (he has enough in his warehouse). He has
• reddish colour - it contains 50 % of red, 25 % of green and 25 % of blue colour;
• greenish colour - it contains 12,5 % of red, 75 % of green and 12,5 % of blue colour;
• bluish colour - it contains 20 % of red, 20 % of green and 60 % of blue colour.
How many litres of each of the colours at the warehouse have to be mixed in order to satisfy the order?
Solution. Denote by
• x - the number of litres of reddish colour to be used;
• y - the number of litres of bluish colour to be used;
• z - the number of litres) greenish colour to be used;
By mixing the colours we want a colour that contains 270 litres of red. Note that reddish contains 50 % red, greenish contains 12,5 % red and bluish 20 % red. Thus the following has to be satisfied:
0,5a;   +   0,125y   +   0,2z   = 270.
Similarly, we require (for blue and green colours respectively) that
0,25a; + 0,75y + 0,2z = 270, 0,25a;   +   0,125y   +   0,6z   = 270.
From the first equation x = 540 — 0,25y — 0, 4z. Substitute for x into the second and third equations to obtain two linear equations of two variables 2,75y + 0, 4z = 540 and 0,25y + 2z = 540. From the second of these we express z = 270 — 0,125y and substitute into the first one we obtain2,7y = 432, that is, y = 160. Therefore z = 270 - 0,125 ■ 160 = 250 and hence x = 540 - 0,25 ■ 160 + 0,4 ■ 250 = 400.
An alternative approach is to deduce consequences from the given equations by a sequence of adding them or multiplying them by non-zero scalars. This is easily handled in the matrix notation (which we met when solving equations with two variables in the previous chapter already). The first row of the matrix consists of coefficients of the variables in the first equation, second of the coefficients in the second equation and third of the coefficients in the third. Therefore the
Later, we follow the general terminology where the notion of vectors is related to fields of scalars only.
2.1.1. Vectors over scalars. For now, a vector is for us an ordered n-tuple of scalars from K, where the fixed n e N is called dimension.
We can add and multiply scalars. We will be able to add vectors, but multiplying a vector will be possible only by a scalar. This corresponds to the idea we have already seen in the plane R2. There, addition is realized as vector composition (as composition of arrows having their direction and size and compared when emanating from the origin). Multiplication by scalar is realized as stretching the vectors.
A vector u= (a1,..., an) is multiplied by a scalar c by multiplying every element of the n-tuple u by c. Addition is defined coordinate-wise.
Basic vector operations
u + v = (ai,..., an) + (&i,. ..,bn) = (ai + h,... ,an + bn) c-u = c- (ai,. .., a„) = (c ■ ai,..., c ■ a„). cu = c(a1,..., a„) = (cai,..., can).
For vector addition and multiplication by scalars we shall use the same symbols as for scalars, that is, respectively, plus and either dot or juxtaposition.
The vector notation convention. We shall not, unlike many other textbooks, use any special notations for vectors and leave it to the reader to pay attention to the context. For scalars, we shall mostly use letters from the beginning of the alphabet, for the vector from the end of the alphabet. The middle part of the alphabet can be used for indices of variables or components and also for summation indices.
In the general theory in the end of this chapter and later, we will work exclusively with fields of scalars when talking about vectors. Now we will work with the more relaxed properties of scalars as listed in 1.1.1.
For vector addition in K", the properties (CG1)-(CG4) (see 1.1.1) clearly hold with the zero element being (notice we define the addition coordinate-wise) 0 = (0,..., 0) G K". We are purposely using the same symbol for both the zero vector element and the zero scalar element. Next, let us notice the following basic properties of vectors:
Vector properties
For all vectors v, w e I" and scalars a, b e K we have (VI) a-(v + w) = a- v + a- w
(V2) (a + b)-v = a-v + b-v
(V3) a-(b-v) = (a-b)-v
(V4) l-v = v
72
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
matrix of the system is
0,125 0,75 0,125
The extended matrix of the system is obtained from the matrix of the system by inserting the column of the right-hand sides of the individual equations in the system:
/ 0,5   0,125 0,2 0,25   0,75 0,2 \ 0,25   0,125 0,6
By doing elementary row transformations sequentially (they all correspond to adding rows and multiplication by scalars with the equations, see 2.1.7) we can eliminate the variables in the equations, one by one:
0,5 0,125 0,25 0,75 0,25 0,125
0,2 0,2 0,6
270 270 270
1 0,25 0 2,75 0 0,25
0,4 0,4 2
1    0,25 0,4 0      1 8 0     11 1,6
540 540 540
540 2160 2160
0,25 3 0,5
0,25 11 1
0,25 1 0
0,4 0,8 2,4
0,4 1,6
8
0,4 8
-86,4
540 1080 1080
540 2160 2160
540 2160 -21600
By back substitution, we compute successively
-21600
-86,4 25°' 2160 - 8-250 =
: 540 - 0,4 -250 -
160,
- 0,25 ■ 160 :
400.
Thus it is necessary to mix 400 litres of reddish, 160 litres of bluish and 250 litres of greenish colour. □
2.A.2.   Solve the system of simultaneous linear equations
x1   +   2x2   +   3x3   = 2, 2xi   —   3x2   —    x3   = —3, —3x1   +    x2   +   2x3   = —3.
Solution. We write the system of equations in the form of the extended matrix of the system
2 3
-3 -1 1 2
Every row of the matrix corresponds to one equation. As in the previous example, equivalent transformation of the equations correspond to the elementary row operations on the matrix and we use them to transform it into the row echelon form
1
2
-3
-1
1	2	3	2
0	-7	-7	-7
0	7	11	3
The properties (V1)-(V4) of our vectors are easily checked for any specific ring of scalars K, since we need just the corresponding properties of scalars as listed in 1.1.1 and 1.1.5, applied to individual components of the vectors. In this way we shall work with, for instance, R™, Q", C", but also with Zn, (Zk)n, n = 1, 2, 3,....
2.1.2. Matrices over scalars. Matrices are slightly more complicated objects, useful when working with vectors.
Matrices of type m/n
is a rectangular
A matrix of the type m/n over scalars schema A with m rows and n columns
/ an    a12    ... aln\
a-21
A =
a-12 a-22
a-2-n
\"ral «m2
where a{j e K for all 1 < i < m, 1 < j < n. For a matrix A with elements a{j we also use the notation A = (a^).
The vector (an, ai2,.. ., ain) e K" is called the (i-th) row of the matrix A, i = l,...,m. The vector (aij, a2j,..., amj) G Km is called the (j-th) column of the matrix A, j = 1,..., n.
Matrices of the type 1/n or n/1 are actually just vectors inK™.
All general matrices can be understood as vectors in -mn^ we just cQjjsJdgj. an the columns.
In particular, matrix addition and matrix multiplication by scalars is defined:
A + B = (aij + bij),   a ■ A = (a ■ aij)
where A = (a,j), B = (bij), a e K.
The matrix — A = (—a{j) is called the additive inverse to the matrix A and the matrix
/0   ... 0N>
0= i
\0  ... Oy
is called the zero matrix. By considering matrices as ran-dimensional vectors, we obtain the following:
Proposition. The formulas for A+B, a-A, —A, 0 define the operations of addition and multiplication by scalars for the set of all matrices of the type m/n, which satisfy properties (V1)-(V4).
2.1.3. Matrices and equations. Many mathematical models are based on systems of linear equations. Matrices are useful for the description of such systems. In order to see this, let us introduce the notion of scalar product of two vectors, assigning to the vectors (a1,..., an) and (x1,..., xn) their product
(ax,. ..,an)- (xx, ■ ■ ■ ,x
axxx +
-\- anxn.
This means, we multiply the corresponding coordinates of the vectors and sum the results.
73
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
/1	2	3	2 \	/ 1	2	3	2
0	1	1	i L	- 0	1	1	1
Vo	0	4	-4	Vo	0	1	-1
First we subtracted from the second row twice the first row,
and to the third row we added three times the first row. Then
we added the second row to the third row and multiplied the
second row by —1/4. Now we restore the system of equations
x1   +   2x2   +   3a3   = 2, x2   +    x3   = 1,
^3
We see immediately that x3 = —1. If we substitute x3 = — 1 into the equation x2 + x3 = 1, we obtain x2 = 2. Then by substituting x3 = —1, x2 = 2 into the first equation, we obtain x1 = 1. □ Systems of linear equations can be written in matrix notation. But is it an advantage, when we can solve the systems even without speaking about matrices? Yes it is, we can handle the equations more conceptually. We can easily decide how many solutions a system has. It is much more efficient in computer assisted computations. Thus we shall get familiar with various operations which can be done with matrices. As we have seen in previous examples, equivalent operations with linear equations correspond to elementary row (column) transformations. Further we have seen that transforming a matrix into a row echelon form, a process called Gaussian elimination, see 2.1.7), solves the system very easily. We demonstrate this on some examples, where we will see that a system can have infinitely many solutions or no solution at all.
2.A.3.   Solve a system of linear equations
2xi	-	x2	+	3x3	= o,
3xi	+	I6x2	+	7x3	= o,
3xi	-	5x2	+	4x3	= o,
7a; i	+	7x2	+	-Wx3	= 0.
Solution. Because the right-hand side of all equations is zero (such a case is called a homogeneous system) we work with the matrix of the system only. We find the solution by transforming the matrix into the row echelon form using elementary row transformations. These correspond to changing the order of equations, multiplying an equation by a non-zero number and addition of multiples of equations. Furthermore, we can always go back and forth between the matrix notation and the original system notation with variables x{. We obtain:
-1	3 ^		(2	-i	3
16	7		0	35/2	5/2
-5	4		0	-7/2	-1/2
7	-10 j		V, o	7/2	1/2
\
Every system of m linear equations in n variables
anai + ai2a2 H-----h a\nxn = b\
a21xx + a22x2 H-----h a2nxn = b2
a-m\X\ + am2x2 H-----h amnxn = bm
can be seen as a constraint on values of m scalar products with one unknown vector (xi,..., xn) (called the vector of variables, or vector variable) and the known vectors of coordinates (an,. .., ain).
The vector of variables can be also seen as a column in a matrix of the type n/1, and similarly the values can be seen as a vector u, and that is ^ again a single column of the matrix of the type n/1. Our system of equations can then be formally written as A ■ x = u as follows:
an
Vaml
			
)	\xn)		
where the left-hand side is interpreted as m scalar products of the individual rows of the matrix (giving rise to a column vector) with the vector variable x, whose values are prescribed by the equations. That means that the identity of the i-th coordinates corresponds to the original i-th equation
anxi H-----h ainxn = h
and the notation A - x = u gives the original system of equations.
2.1.4. Matrix product. In the plane, that is, for vectors of dimension two, we developed a matrix calculus, (d ~~±Z-3 We noticed that it is effective to work with (see 1.5.4). Now we generalize such a calculus and we develop all the tools we know already from the plane case to deal with higher dimensions n.
It is possible to define matrix multiplication only when the dimensions of the rows and columns allow it, that is, when the scalar product is defined for them as before:
Matrix product
For any matrix A = (a^) of the type ra/n and any matrix B = (bjk) of the type n/q over the ring of scalars K we define their product C = A ■ B = (cik) as a matrix of the type ra/q with the elements
n
°ik ~ 5^ aijbjk, for arbitrary \ <i <m, \ <k <q.
That is, the element of the product is exactly the scalar product of the i-th row of the matrix on the left and of the fc-th column of the matrix on the right. For instance we have
f2    1 \   / 2    1   1\ _ /3   2 3\
.1 -iJ'l-i o iJ~U i or
74
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
From there we see that the second, third and fourth equations are multiples of the equation 7a2 + a3 = 0. We continue:
/ 2    -1       3    \      / 2    -1     3 \
0   35/2 5/2 0   -7/2 -1/2 \ 0    7/2     1/2 )
0 35/2 5/2 0     0 0
V o
0
0
2.1.5. Square matrices. If there is the same number of rows and columns in the matrix, we speak of a square matrix. The number of rows or columns is then called the dimension of the matrix. The matrix
'i ... o^
E = (Sa)
f2     1 3\ 0    7 1
0    0 0 \ 0    0    0 j Considered as equations, the last two are redundant, and we are left with just
2a; i   —     x2   +   3a3   = 0, 7a2   +     a3   = 0
We substitute for the variable a3 a parameter ( 6 8 and express
^0
1,
x2
1 1
--X-i =--t
7 3 7
1 , 11 xi = - (x2 - 3a3) = ——t.
If we now substitute t = —7s, we obtain the result in a simple form
(ai, a2, a3) = (lis, s, -7s) , s G The whole system has infinitely many solutions. 2.A.4,
□
Find all solutions of the system of linear equations
3ai + 3a3 — 5a4 = —8,
x1 — x2 + a3 — a4 = —2,
—2ai — x2 + 4a3 — 2a4 = 0,
2ai + x2 — a3 — a4 = —3.
Solution. The corresponding extended matrix of the system is
/  3 0     3 -5 1-11-1
-2 -1    4 -2
\  2 1-1-1
-8\
-2
0
-3 J
By changing the order of rows (equations) we obtain
-2 \
-3
0
-8 /
which we transform into the row echelon form:
i o o
0
/ 1 -1 1 -1
2 1-1-1
-2 -1 4 -2
\  3 0 3 -5
is called the unit matrix, or alternatively, the identity matrix. The numbers Sij defined in such a way are also called the Kro-necker delta. When we restrict ourselves to square matrices over K of fixed dimension n, the matrix product is defined for any two matrices. That is, there is the well defined multiplication operation there. Its properties are similar to that of scalars:
Proposition. On the set of all square matrices of dimension n over an arbitrary ring of scalars K, the multi-™/    plication operation is defined with the following properties of rings (see 1.1.5): (Ol) Multiplication is associative.
(03) The unit matrix E = (8^) is the unit element for multiplication.
(04) Multiplication and addition is distributive.
In general, neither the property (02) nor (Ol) are true. Therefore, the square matrices for n > 1 do not form an integral domain, and consequently they cannot be a (commutative or non-commutative) field.
Proof. Associativity of multiplication - (Ol): Since scalars are associative, distributive and commutative, we can compute for any three matrices A = (ay-) of type ra/n, B = (bjk) of type n/p and C = (cm) of type p/q:
A ■ B = ( Eaij: ■ bjk J,   B ■ C = i^bjk ■ cM (A-B)-C= (E(EajAfeH0 = (^atfijkCki),
k       j j,k
A-(B-C) = (J2av(J2b3kCkl)) = (J2avbJkCkl)-
j k j,k
Note that while computing, we relied on the fact that it does not matter in which order are we performing the sums and products, that is, we were relying on the properties of scalars.
We can easily see that multiplication by a unit matrix has the property of a unit element:
1	-1	1	-1		-2
2	1	-1	-1		-3
-2	-1	4	-2		0
co	0	3	-5		-8
1	-1	1	-1	-2	
0	3	-3	1	1	
0	0	3	-3	-3	
0	0	3	-3	-3	
-1	1	-1	-2
co	-3	1	1
-3	6	-4	-4
	0	-2	-2
-1	1	-1	-2
co	-3	1	1
0	co	-3	-3
0	0	0	0
A-E-
i o o
0
The system has thus infinitely many solutions, because we have three equations in four variables. These three equations have exactly one solution for any choice for the variable
an
\0-ml
air,
(1 0 0 1
\0 0
0
V
A
and similarly from the left,
E ■ A = A.
It remains to prove the distributivity of multiplication and addition. Again using the distributivity of scalars we can
75
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
x4 G R. Thus for x4 we substitute the parameter t e R and go back from the matrix notation to the system of equations
xi   -     x2 + 3x2 -
X-3 -	t =	-2,
3x3 +	t =	1,
3x3 -	3t =	-3.
From the last equation we have a; 3 = 23 into the second equation gives
3x2-3t + 3 + t = l,   that is,
í — 1. Substituting for
x2 = - (2i - 2).
Finally, using the first equation, we have
x1 - i (2í - 2) +1 - 1 - t = -2,
tj.
Xl = - (2í - 5).
The set of solutions can be written (for t = 3s) in the form {(xi, x2, x3, x4) = (2s — |, 2s — §, 3s — 1, 3s) , s G R} We return to the extended matrix of the system and transform it further by using the row transformations in order to have (still in the row echelon form) the first non-zero number of every row (the so-called pivot) equal to one and that all the other numbers in the column of the pivot are zero. We have
/I 0
0
V 0
0 0
V 0
0 0
V 0
-1 3 0
0
-1
1 0
0
-1
1 0 0
-1	2 \	
1	1	
-3	-3	
0	0 )	
-1	-2	\
1/3	1/3	
-1	-1	
0	0	/
0	-1	\
-2/3	-2/3	
-1	-1	
0	0	J
0 0 -2/3 -5/3 \
0   1 0 -2/3 -2/3
0   0 1-1 -1
\ 0   0 0 0 0 /
because first we have multiplied the second and the third row
by 1 / 3, then we have added the third row to the second and its
(—1)-multiple to the first. Finally we have added the second
row to the first. From the last matrix we easily obtain the
result
x2
t e
/-5/3\ /2/3\ -2/3 2/3 x3 -1     +t 1
w   V 0 J \ij
Free variables are those whose columns do not contain any pivot (in our case there is no pivot in the fourth column, that is, the fourth variable is free and we use it as a parameter). □
easily calculate for matrices A = (a^) of the type ra/n, B = (bjk) of the type n/p, C = (cjk) of the type n/p, D = (dki) of the type pj q
A - (B + C) = [J2av (bjk +
j j
(B + C)-D = (Y,(bjk + c3k)dk ^ k
[ Y, bjkdki) + (Y, cJkdki) )=B-D + C-D.
k k
As we have seen in 1.5.4, two matrices of dimension two do not necessarily commute: for example
1 o\ fo 1\ _ fo 1
0 o)'\o 0) ~ \p 0
0 1\ fl o\ _ fo 0
0 0) '[o 0) ~ [p 0
This gives us immediately a counterexample to the validity of (02) and (OI). For matrices of type 1/1 both axioms clearly hold, because the scalars itself have them. For matrices of greater dimension the counterexamples can be obtained similarly. Simply place the counterexamples for dimension 2 in their left upper corner, and select the rest to be zero. (Verify this on your own!) □
In the proof we have actually worked with matrices of more general types, thus we have proved the properties in greater generality:
Associativity and distributivity
Matrix multiplication is associative and distributive, that is,
A ■ (B ■ O = (A ■ B) ■ C
A-{B + C)=A-B + A-C,
whenever are all the given operations denned. The unit matrix is a unit element for multiplication (both from the right and from the left).
2.1.6. Inverse matrices. With scalars we can do the following: from the equation a ■ x = b with a fixed gj;- invertible a we can express x = a-1 ■ b for any b. We would like to be able to do this for matrices too. So we need to solve the problem - how to tell that such a matrix exists, and if so, how to compute it? We say that B is the inverse of A if
A ■ B = B ■ A = E.
Then we write B = A-1. From the definition it is clear that both matrices must be square and of the same dimension n. A matrix which has an inverse is called an invertible matrix or a regular square matrix.
76
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.A.5.   Determine the solutions of the system of equations
-   x2 +
3xi x\
-2x1 - x2 2xi   + x2
3x3	—   5x4 =	8,
x3	—    x4 =	-2,
4x3	-    2^4 =	0,
x3	—    X4 =	-3.
Solution. Note that the system of equations in this exercise differs from the system of equations in the previous exercise only in the value 8 (instead of —8) on the right-hand side. If we do the same row transformations as in the previous exercise, we obtain
CO	0	CO	-5	M	1
1	-1	1	-1	-2	
-2	-1	4	-2	0	
2	1	-1	-1	-*	\
-1	1	-1		/
3	-3	1	1	
0	3	-3	-3	
0	3	-3	13 / \	
-1	1	-1	"2\
3	-3	1	1
0	3	-3	"3
0	0	0	16
where the last operation was subtracting the third row from the fourth. From the fourth equation 0 = 16 follows that the system has no solutions. Let us emphasize than whenever we obtain an equation of the form 0 = a for some u/0 (that is, zero row on the left side and non-zero number after the vertical bar) when doing the row transformation, the system has no solutions. □
You can find more exercises for systems of systems of linear equations on the page 127
Now we are going to manipulate with matrices to get more familiar with their properties.
2.A.6. Matrix multiplication. Note that, in order to be able to multiply two matrices, the necessary and sufficient condition is that the first matrix has the same number of columns as the number of rows of the second matrix. The number of rows of the resulting matrix is then given by the number of rows of the first matrix, the number of columns then equals the number of columns of the second matrix.
In the subsequent paragraphs we derive (among other things) that B is actually the inverse of A whenever just one of the above required equations holds. The other is then a consequence.
We easily check that if A-1 and B~x exist, then there also is the inverse of the product A ■ B
(1) (A-B)-1 = B~X ■ A-1.
Indeed, because of the associativity of matrix multiplication proved a while ago, we have
(B-1 ■ A-1) -(A-B) = B-1 ■ (A-1 ■ A)-B = E
(A-B)- (B-1 ■ A-1) = A ■ (B ■ B-1) ■ A"1 = E.
Because we can calculate with matrices similarly as with scalars (they are just a little more complicated), "-^"i"^'"  the existence of an inverse matrix can really help us with the solution of systems of linear equations: if we express a system of n equations for n unknowns as a matrix product
A ■ x =
an
Vaml
0-17}
\.			
1			
and when the inverse of the matrix A exists, then we can multiply from the left by A-1 to obtain
A'1 ■ u = A'1 ■ A ■ x = E ■ x = x,
that is, A-1 ■ u is the desired solution.
On the other hand, expanding the condition A-A-1 = E for unknown scalars in the matrix A-1 gives us n systems of linear equations for the same matrix on the left and different vectors on the right. Thus we should think about methods for solutions of the systems of linear equations.
2.1.7. Equivalent operations with matrices. Let us gain some practical insight into the relation between systems of equations and their matrices. Clearly, searching for the inverse can be more complicated than finding the direct solution to the system of equations. But note that whenever we have to solve more systems of equations with the same matrix A but with different right sides u, then yielding A-1 can be really beneficial for us.
From the point of view of solving systems of equations A ■ x = u, it is natural to consider the matrices A and vectors u equivalent whenever they give a system of equations with the same solution set. Let us think about possible operations which would simplify the matrix A such that obtaining the solution is easier.
We begin with simple manipulations of rows of equations which do not influence the solution, and similar modifications of the right-hand side vector. If we are able to change a square matrix into the unit matrix, then the right-hand side vector is a solution of the original system. If some of the rows of the system vanish during the course of manipulations (that is,
77
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
ŕ	0	-5\	/
2	7	15	B=[
\2	7	13/	\
iv)
v) (1   3 -3)
vi)   1 2
Remark. Parts i) and ii) in the previous exercise show that multiplication of square matrices is not commutative in general. In part iii) we see that if we can multiply two rectangular matrices, then it is possible only in one of the orders. In parts iv) and v) note that (A ■ B)T = BT ■ AT.
2.A.7. Let
A ■
Can the matrix A be transformed into B using only elementary row transformations (we say then that such matrices are row equivalent)?
Solution. Both matrices are row equivalent with the three-dimensional identity matrix. It is easy to see that row equivalence on the set of all matrices of given type is indeed an equivalence relation. Thus the matrices A and B are row equivalent. □
2.A.8.   Find a matrix B for which the matrix C = B - A is 5°°J>  in row echelon form, where
A-
Solution. If we multiply the matrix A successively from the left by elementary matrices (consider what elementary row transformations does it correspond to)
E1 =
E* =
En =
(3	-1	3	
5	-3	2	3
1	-3	-5	0
V7	-5	1	4/
(°	0	1			' 1	0	0	
0	1	0	0	,    E2 =	-5	1	0	0
1	0	0	0		0	0	1	0
	0	0	1/		.0	0	0	1/
/I	0	0	ON		/I	0	0	°\
0	1	0	0	,   E4 =	0	1	0	0
-3	0	1	0		0	0	1	0
\o	0	0	h			0	0	1/
A	0	0	0N	\	A	0	0	
0 0	1/3 0	0 1	0 0	,     Eq =	0 0	1 -2	0 1	0 0
\o	0	0	1,	1	\o	0	0	1/
they become zero), then we get some direct information about the solution. Our simple operations are:
Elementary row transformations
• interchanging two rows,
• multiplication of any given row by a non-zero scalar,
• adding another row to any given row.
These operations are called elementary row transformations. It is clear that the corresponding operations at the level of the equations in the system do not change the set of the solutions whenever our ring of coordinates is an integral domain.
Analogically, elementary column transformations of matrices are
• interchanging two columns
• multiplication of any given column by a non-zero scalar,
• adding another column to any given column.
These do not preserve the solution set, since they change the variables themselves.
Systematically we can use elementary row transformations for subsequent elimination of variables. ; / This gives an algorithm which is usually called *~ the Gaussian elimination method. Henceforth, we shall assume that our scalars come from a integral domain (e.g. integers are allowed, but not say Z4).
Gaussian elimination of variables
Proposition. Any non-zero matrix over an arbitrary integral domain of scalars K can be transformed, using finitely many elementary row transformations, into row echelon form:
• For each j, if      = 0 for all columns k = 1,..., j, then akj = Ofor all k > i,
• (/ a(i-i)j is the first non-zero element at the (i — l)-st
row, then a.
0.
Proof. The matrix in row echelon form looks like
0
0
&2k
H 2m
0 Clip
J
The matrix can (but does not have to) end with some zero rows. In order to transform an arbitrary matrix, we can use a simple algorithm, which will bring us, row by row, to the resulting echelon form:
78
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
E7 =
0 0
0
1
0
-4
0 0
we obtain
B — E^E7EqE^E4,E3E2Ei
C =
A	-3	-5	0 \
0	1	9/4	1/4
0	0	0	0
\o	0	0	o )
	/I 0	0 0\	
	0 1/4	0 0	
8 —	0 0	1 0	
	[p 0	0 l)	
(°	0	1	°\
0	1/12 -	-5/12	0
1	-2/3	1/3	0
\o	-4/3 -	-1/3	V
□
Gaussian elimination algorithm
(1) By a possible interchange of rows we can obtain a matrix where the first row has a non-zero element in the first non-zero column. Let that column be column j. In other words, ay ^ 0, but aiq = 0 for all i, and all q,
1 < q < 3-
(2) For each i = 2,..., multiply the first row by the element a,ij, multiply i-th row by the element ay and subtract, to obtain a{j = 0 on the i-th row.
(3) By repeated application of the steps (1) and (2), always for the not-yet-echelon part of rows and columns in the matrix we reach, after a finite number of steps, the final form of the matrix.
This algorithm clearly stops after a finite number of steps and provides the proof of the proposition. □
matrices C = {
2.A.9. Complex numbers as matrices. Consider the set of
a b
—b a
C is closed under addition and matrix multiplication, and further show that the mapping / : C —> ^a + bi satisfies f(M + N) =
f(M)+f(N) and./': \l\ f(M)-f(N) (onthe left-hand sides of the equations we have addition and multiplication of matrices, on the right-hand sides we have addition and multiplication of complex numbers). Thus the set C along with multiplication and addition can be seen as the field C of complex numbers. The mapping / is called an isomorphism (of fields). Thus for instance we have
The given algorithm is really the usual elimination of , a, b G R}. Note that   variables used in the systems of linear equations.
In a completely analogous manner we define the column echelon form of matrices and considering column elementary transformations instead the row ones, we obtain an algorithm for transforming matrices into the column echelon form.
Remark. Although we could formulate the Gaussian elim-(Ji., ination for general scalars from any ring, this does not make much sense in view of solving equations. Clearly having divisors of zero among the scalars, we fS   might get zeros during the procedure and lose information this way. Think carefully about the differences between the choices K = Z, K = R and possibly Z2 or Z4.
3 5 -5 3
8 -9
9 8
69 13 -13 69
which corresponds to (3 + 5i) ■ (8 — 9i) = 69 — 13*.
2.A.10.   Solve the equations for matrices
■Xx =
1 2 3 4
Solution. Clearly the unknowns X1 and X2 must be matrices of the type 2 x 2 (in order for the products to be denned and that the result is a matrix of the type 2 x 2). Set
X2
b2 d2
On the other hand, if we are dealing with fields of scalars, we can always arrive at a row echelon form where the nonzero entries on the "diagonal" are ones. This is done by applying the appropriate scalar multiplication to each individual row. However, this is not possible in general - think for instance of the integers Z.
2.1.8. Matrix of elementary row transformations. Let us
now restrict ourselves to fields of scalars K, that is, every nonzero scalar has an inverse.
Note that elementary row or column transformations correspond respectively to multiplication from the left or right by the following matrices (only the differences from the unit matrix are indicated):
(1) Interchanging the i-th and j-th row (column)
and multiply out the matrices in the first given equation. We obtain
f a\ + 3ci , 3ai + 8ci
3bi + 8di
1 2 3 4
0
i-th row
j-th row
79
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
that is,
ai
3<2i
+ 3ci
+ 8ci
+ 3d!
= 1,
= 2,
= 3,
8di   = 4.
3&i + By adding a (—3)-multiple of the first equation with the third equation we obtain c\ = 0 and then a1 = 1. Analogously, by adding a (—3)-multiple of the second equation to the fourth equation we obtain di = 2 and then b\ = —4. Thus we have
Xl = (o 2 We can find the values a2, &2, c2, d2 by a different approach. If A is a square matrix, we write A-1 to denote its inverse, so that A ■ A-1 = A-1 ■ A = E, the unit matrix) It is easy to check that
a   b\~1 _     1     (d -b\ c   dj ad-be \-c    a J '
which holds for any numbers a, b,c,d e K provided ad—be ^
0. (This is easy to derive; it also directly follows from formula
1 in 2.2.11). We calculate
1   3\_1 _ f-8 3
3   8)    ~ \ 3 _1, Multiplying the given equations by this matrix from the right gives
X2
and thus
1 2 3 4
X, =
-2 -12
-8
-1
□
2A.11. Solve the matrix equation
1 í = ß
O
2.A. 12. Computing the inverse matrix. Compute the inverse of the matrices
	3	2\	A	0	A
A= 5	6	3	ß = 3	3	4
\3	5	2	\2	2	3
Then determine the matrix [AT ■ B) Solution. We find the inverse by the following method: write next to each other the matrix A and the unit matrix. Then use elementary row transformations so that the sub-matrix A changes into the unit matrix. This will change the original unit sub-matrix to A-1. We obtain
(2) Multiplication of the i-th row (column) by the scalar a:
\
i-th row
•••/
(3) To row i, add row j (columns):
i-th row and j-th column
V
This trivial observation is actually very important, since the product of invertible matrices is invertible */ (recall 2.1.6(1)) and all elementary transformations over a field of scalars are invertible (the definition of the elementary transformation itself ensures that inverse transformations are of the same type and it is easy to determine the corresponding matrix).
Thus, the Gaussian elimination algorithm tells us, that for an arbitrary matrix A, we can obtain its equivalent row echelon form A = P - A by multiplying with a suitable invertible matrix P = Pk ■ ■ ■ Pi from the left (that is, sequential multiplication with k matrices of the elementary row transformations).
If we apply the same elimination procedure for the columns, we can transform any matrix B into its column echelon form B' by multiplying it from the right by a suitable invertible matrix Q = Qi ■ ■ ■ Qt. If we start with the matrix B = A in row echelon form, this procedure eliminates only the still non-zero elements out of the diagonal of the matrix and in the end we can transform the remaining elements to be units. Thus we have verified a very important result which we will use many times in the future:
2.1.9. Theorem. For every matrix A of the type m/n over a field of scalars K, there exist square invertible matrices P and Q of dimensions m and n, respectively, such that the matrix P ■ A is in row echelon form and
P-A-Q =
0 0
0 0
1 0 0 0
°\
0 0
J
The number of the ones in the diagonal is independent of the particular choice of P and Q.
Proof. We already have proved everything but the last sentence. We shall see this last claim below in 2.1.11. □
80
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
In the first step we subtracted from the first row the third row, in the second step we added a (—5) -multiple of the first to the second row and added a (—3)-multiple of the first row to the third row, in the third step we subtracted from the second row the third row, in the fourth step we added a (—2)-multiple of the second row to the third row, in the fifth step we added a (—5)-multiple of the third row to the second row and added a 2-multiple of the third row to the first row, and in the last step we changed the second and the third row. We have obtained the result
/ 3    -4    3 \ A'1 = I 1    -2    2 . \-7   11 -9/
Note that when calculating the matrix A-1 we did not have to cope with fractions thanks to the suitably chosen row transformations. Although we could carry on similarly when doing the next exercise, that is, B~x, we will rather do the more obvious row transformations. We have
2.1.10. Algorithm for computing inverse matrices. In the
previous paragraphs we almost obtained the complete algorithm for computing the inverse matrix. Using the simple modification below, we find either that the inverse does not exist, or we compute the inverse. Keep in mind that we are still working over a field of scalars.
Equivalent row transformations of a square matrix A of dimension n leads to an invertible matrix P' such that P' ■ A is in row echelon form. If A has an inverse, then there exists also the inverse of P' ■ A. But if the last row of P' - A is zero, then the last row of P' - A ■ B is also zero for any matrix B of dimension n. Thus, the existence of a zero row in the result of (row) Gaussian elimination excludes the existence of A-1.
Assume now that A-1 exists. As we have just seen, the row echelon form of A will have exclusively non-zero rows only, In particular, all diagonal elements ofP'-A are non-zero. But now, we can employ row elimination by the elementary row transformation from the bottom-right corner backwards and also transform the diagonal elements to be units. In this way, we obtain the unit matrix E. Summarizing, we find another invertible matrix P" such that for P = P" ■ P' we have P ■ A = E.
Now observe that we could clearly work with columns instead of row transformation and thus, under the assumption of the existence of A-1, we would find a matrix Q such that A ■ Q = E. From this we see immediately that
P = P ■ E = P ■ (A ■ Q) = (P ■ A) ■ Q = E ■ Q = Q.
That is, we have found the inverse matrix
A-1 = P = Q
for the matrix A. Notice that at the point of finding the matrix P with the property P -A = E, we do not have to do any further computation, since we have already obtained the inverse matrix.
In practice, we can work as follows:
Computing the inverse matrix
Write the unit matrix E to the right of the matrix A, producing an augmented matrix (A, E). Transform the augmented matrix using the elementary row transformations to row echelon form. This produces an augmented matrix (PA, PE), where P is invertible, and PA is in row echelon form. By the above, either PA = E, in which case A is invertible and P = PE = A-1, or PA has a row of zeros, in which case we conclude that the inverse matrix for A does not exist.
2.1.11. Linear dependence and rank. In the previous practical algorithms dealing with matrices we worked all the time with row and column 5 additions and scalar multiplications, seeing them as vectors.
Such operations are called linear combinations. We shall return to such operations in an abstract sense later on in 2.3.1. But it will be useful to understand their core meaning right
81
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
that is,
B-1
' 1 2 -1 1
Using the identity
[AT ■ B) _1 = Br
B-1 ■ (A-1)1
1
and the knowledge of the inverse matrices computed before, we obtain
/ 1     2    -3\    / 3 =   -1    1 -1--4 V 0    -2    3 /   \ 3
□
-14	-9	42
-10	-5	27
17	10	-49
2A.13. Compute the inverse of the matrix
/i    0 -2\ A =   2   -2    1 . \5   -5    2 /
2A.14. Calculate A5 and A~3, if
I 2    -1 1\
A=   -1    2 -1 . \0     0 lj
2.A./5. Compute the inverse of the matrix
O
o
8	3	0	0	
5	2	0	0	0
0	0	-1	0	0
0	0	0	1	2
Vo	0	0	3	
o
2A.16. Determine whether there exists an inverse of the matrix
C =
(11 1 l\
11-11
1-1 1 -1
\i -i-i iy
If yes, then compute C _1.
now. A linear combination of rows of a matrix A = (a^) of type m/n is understood as an expression of the form
ciUi H-----h ckUi
where c, are scalars, u,
are rows of the
matrix A. Similarly, we can consider linear combinations of columns by replacing the above rows Uj by the columns Uj =
(dij, . . . , Ojmj).
If the zero row can be written as a linear combination of some given rows with at least one non-zero scalar coefficient, we say that these rows are linearly dependent. In the alternative case, that is, when the only possibility of obtaining the zero row is to select all the scalars Cj equal to zero, the rows are called linearly independent.
Analogously, we define linearly dependent and linearly independent columns.
The previous results about the Gaussian elimination can be now interpreted as follows: the number of nonzero "steps" in the row (column) echelon form is ► always equal to the number of linearly independent !'J rows (columns) of the matrix. Let Eh be the matrix from the theorem 2.1.9 with h ones on the diagonals and assume that by two different row transformation procedures into the echelon form we obtain two different h' < h. But then according to our algorithm there are invertible matrices P, P', Q, and Q' such that
Eh = P ■ A ■ Q, Eh, = P' ■ A ■
In particular, Eh = P ■ P'~x ■ Eh> ■ Q1'1 invertible matrices P" and Q" such that
) and so there are
P" ■ Eh,
In the product P" ■ Eh> there will be more zero rows in the bottom part of the echelon matrix than we see in Eh and we must be able to reach Eh using only elementary column transformations. This is clearly not possible, because the zero rows remain zero there.
Therefore the number of ones in the matrix P - A ■ Q in theorem 2.1.9 is independent of the choice of our elimination procedure and it is always equal to the number of linearly independent rows in A, which must be the same as the number of linearly independent columns in A. This number is called the rank of the matrix and we denote it by h(A). We have the following theorem:
Theorem. Let A be a matrix of type m/n over a field of scalars K. The matrix A has the same number h(A) of linearly independent rows as linearly independent columns. In particular, the rank is always at most the minimum of the dimensions of the matrix A.
The algorithm for computing the inverse matrix also says that a square matrix A of dimension m has an inverse if and only if its rank equals m.
82
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
O
2A.17. Compute A-1, if
(b) A:
2.A.18.   Find the inverse to the n x n matrix (n > 1)
Vj*1'
o
/2-n 1 1 2-n
1
V i
1
1
1 \
1
2-n 1 1 2-n/
Solution. You can try for small n (n = 2, 3, 4), which is easy to compute with the known algorithm, and then guess the general form.
1
A~
n- 1
1   1 0
V i
1\
1
1
□
We have already encountered systems of linear equations at the beginning of the chapter. Now we will deal with them in more detail. We use the inverse matrix to assist in computing the solution to the system of linear equations. Note that we do the same computation as before. To express the variables is the same as to bring the matrix of the system with equivalent transformation to the identity matrix and that is the same as to multiply the matrix of the system with the inverse matrix.
2.A.19. Participants of a trip. There were 45 participants of a two-day bus trip. On the first day, the fee for a watchtower visit was €30 for an adult, €16 for a child and €24 for a senior. The total fee for the first day was €1116. On the second day, the fee for a bus with a palace and botanical garden tour was €40 for an adult, €24 for a child and €34 for a senior. The total fee for the second day was €1 542. How many adults, children and seniors were there among the participants? Solution. Introduce the variables
x for the „number of adults";
y for the „number of children";
2.1.12. Matrices as mappings. Similarly to the way we worked with matrices in the geometry of the plane (see 1.5.7), we can interpret every matrix A of the type m/n as a mapping
A
x n> A ■ x.
By the distributivity of matrix multiplication, it is clear how the linear combinations of vectors are mapped using such mappings:
A - (ax + by) = a (A ■ x) + b (A ■ y).
Straight from the definition we see, by the associativity of multiplication, that composition of mappings corresponds to matrix multiplication in given order. Thus invertible matrices of dimension n correspond to bijective mappings A : I" —>
Remark. From this point of view, the theorem 2.1.9 is very interesting. We can see it as follows: the rank of the matrix determines how large is the image of the whole K" under this mapping. In fact, if _ A = P- Ek-Q where the matrix has k ones as in 2.1.9, then the invertible Q first bijectively "shuffles" the n-dimensional vectors in K", the matrix then "copies" the first k coordinates and completes them with the remaining m — k zeros.
This "fc-dimensional" image then cannot be enlarged by multiplying with P. Multiplying by P can only bijectively reshuffle the coordinates.
2.1.13. Back to linear equations. We shall return to the notions of dimension, linear independence and so on in the third part of this chapter. But we should notice now what our results say about the solutions of the systems of linear equations. If we consider the matrix of the system of equations and add to it the column of the required results, we speak about the extended matrix of the system. The above Gaussian elimination approach corresponds to the sequential variable elimination in the equations and the deletion of the linearly dependent equations (these are simply consequences of other equations).
Thus we have derived complete information about the size of the set of solutions of the system of linear equations, based on the rank of the matrix of the system. If we are left with more non-zero rows in the row echelon form of the extended matrix than in the original matrix of the system, then there cannot be a solution (simply, we cannot obtain the given vector value with the corresponding linear mapping). If the rank of both matrices is the same, then the backwards elimination provides exactly as many free parameters as the difference between the number of variables n and the rank h(A). In particular, there will be exactly one solution if and only if the matrix is invertible.
All this will be stated explicitely in terms of abstract vector spaces in the important Kronecker-Capelli theorem, see 2.3.5.
83
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
z for the „number of seniors"; There were 45 participants, therefore
a   +   y   +   z   = 45.
The fees for the first and second days respectively imply that
30a; + 16y + 24z = 1116, 40a   +   24y   +   34z   = 1542.
We write the system of three linear equations in the matrix notation as
We compute
	1	1\	x\	/ 45
30	16	24	■   V =	1116
40	24	34/	W	\1542
1	1	
30	16	24
40	24	34
16	5	-4
30	3	-3
-40	-8	7
Hence the solution is
fx\ 1 I 16 5-4 \y \= - 30 3 -3 \zj     D \-40   -8 7
-i©-8-
expressed in words, there were 22 adults, 12 children and 11 seniors. □ The latter approach is particularly efficient if we have to solve several systems with the same matrix on the left hand side but different values on the right hand side.
But what if the matrix of the system is not invertible? Then we cannot use the inverse matrix for solving the system. Such a system cannot have a single solution. As the reader may have noticed above, a system of linear equations either has no solution, has one solution or has infinitely many solutions, depending on one or more free parameters (for instance, it cannot have exactly two solutions). We should have also noticed when dealing with equations with two variables in the previous section, that the space of the solutions is either a vector space (in the case when the right-hand side of the system is zero, we speak of a homogeneous system of linear equations) or an affine space, see 4.1.1 (in the case when the right-hand side of at least one of the equations is non-zero, we speak of a non-homogeneous system of linear equations).
We can recognize all the possibilities from the rank of the matrices, i.e. the number of nonzero rows left in the row-echelon form.
2. Determinants
In the fifth part of the first chapter, we introduced the scalar function det on square matrices of dimension 2 over the real numbers, called deter-=gS~fi minant, see 1.5.5. We saw that the determinant assigned a non-zero number to a matrix if and only the matrix was invertible. We did not say it in exactly this way, but you can check for yourself in previous paragraphs starting with 1.5.4 and formula 1.5.5(1).
We saw also that determinants were useful in another way, see the paragraphs 1.5.10 and 1.5.11. There we showed that the volume of the parallelepiped should be linearly dependent on every two of the vectors denning it. It was useful to require the change of the sign when changing the order of these vectors. Because determinants (and only determinants) have these properties, up to a constant scalar multiple, we concluded that it was determining the volume. Now we will see that we can proceed similarly for every finite dimension.
We work again with arbitrary scalars K and matrices over these scalars. Our results about determinants will thus hold for all commutative rings, notably also for integer matrices or matrices over any residue classes.
2.2.1. Definition of the determinant. Recall that the bijec-tive mapping from a set X to itself is called a permutation of the set X, see 1.3.3. If X = {1,2,..., n}, the permutation can be written by putting the resulting ordering into a table:
1 2 <r(l) a(2)
n a {n)
The element i 6 lis called a fixed point of the permutation a if a{x) = x. If there exist exactly two distinct elements a, y e X such that tr(a) = y while all other elements z G X are fixed points, then the permutation a is called a transposition, and we denote it by (a, y). Of course, then a(y) = x holds for such a transformation.
For dimension 2, the formula for a determinant was simple - take all possible products of two elements, one from every column and every row of the matrix, give them a sign such that interchanging two columns leads to the change of the sign of the whole result, and sum all of them (that is, both):
A ■
det A = ad — be.
Consider now square matrices A = (a^) of dimension n over K. The formula for the determinant of the matrix A is also composed of all possible products from elements from individual rows and columns, with properly chosen signs.
In dimension 3 we can guess the correct signs easily. The product of the elements on the diagonal should be with positive sign and we want anti-symmetry when interchanging two columns or rows. This gives the so called Sarrus rule:
84
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.A.20.   Determine the rank of the matrix
A =
Then determine the number of solutions of the system of linear equations
Sarrus rule
/I	-3	0	1\
1	-2	2	-4
1	-1	0	1
I-2	-1	1	-V
Xl
—3a;i xi
+ +
X2
2x2 2x2 4x2
+
2:3 2:3
+
+ x3
2x4 X4 X4
2x4
Determine also all solutions of the system
Xl	+	X2	+	2:3	- 2x4	= 0,
3xi	-	2x2	-	2:3	— X4	= 0,
	+	2x2			+ x4	= 0,
Xi	-	4x2	+	2:3	- 2x4	= 0
and of the system
Xl Xi Xl
-2xi
3x2
2x2   + 2x3
X2
X2    + 2;3
1,
-4,
1,
-2.
Solution. Transforming the matrix to the row-echelon form, we check that the rank is four. (The rank cannot exceed the number of rows or columns). The first of the three given system is given by the extended matrix
/  1 1 1-2
-3 -2 -1 -1
0 2 0 1
\   1 -4 1 -2
4\
5
1
3 )
But the left-hand side is exactly AT and thus we can get the column-echelon form the same way as before. In particular, the columns of the matrix are linearly indepent and the rank is maximal, i.e. four again. Therefore there exists a matrix (AT) 1 and the system has a unique solution
(xi, x2, x3, x4)   = (A )    -(4,5,1,3) .
The second of the systems has the same left-hand side (given by the matrix AT) as the first. Because the numbers on the right-hand side of the equations in the system do not influence the number of solutions and because every homogeneous system has a zero solution, the only solution of the second system is given by
(xi, x2, x3, x4) = (0,0,0,0).
an ai2 ai3 H21   H22 H23
031     a32 033
011022033 + 013021032 + 012023031 -Hl3a22Ö31 — HllÖ23a32 ~ 012021^33
The general definition can be formulated via a sum over all permutations:
\A\
Definition of determinant
The determinant of the matrix A is a scalar det A defined by the relation
1^1 =   X]  Sgn(fJ)al<r(l) ' a2a(2) ■ ■ ■ an„(n)
where En is the set of all possible permutations over {1,..., n) and the symbol sgn for a permutation a, called the parity of a, will be described below. Each of the expressions
is called a term in the determinant \ A\.
2.2.2. Parity of permutation. How should we define the jk^i       sign of a permutation? We say that a pair of |u^/      elements a,b e X = {l,...,n} forms an .-"/fivt;   inversion in the permutation a, if a < b and a (a) > cr(b). A permutation a is called even or odd, if it contains an even or odd number of inversions, respectively.
Thus,  the parity sgntr of the permutation a is
(^number of inversions  an(j wg denQte jt fey sgn(fJ). Tnis
amounts to our definition of sign for computing determinant. But we should like to know how to calculate the parity. The following theorem reveals that the Sarrus rule really defines the determinant in dimension 3.
Theorem. Over the set X = {1,2,... ,n} there are exactly n\ distinct permutations. These can be ordered in a sequence such that every two consecutive permutations differ in exactly one transposition. Every transposition changes parity.
For any chosen permutation a there is such a sequence starting with a.
Proof. For n = 1 or n = 2, the claim is trivial. We prove the theorem by induction on the size n of the set X.
Assume that the claim holds for all sets with n — 1 elements and consider a permutation a(l) = a±,... ,a(n) = an. According to the induction assumption, all the permutations that end with an can be obtained in a sequence, where every two consecutive permutations differ in one transposition. There are (n — 1)\ such permutations. In order to proceed further, we select the last of them, and use the transposition of cr(n) = an with some element a^ which has not been at the last position yet. Once again, we form a sequence of all permutations that end with a^. After doing this procedure n-times, we obtain n(n — 1)! = n\ distinct permutations - that
85
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
The third system is given by the extended matrix
V
-3 0 -2 2 -1 0 -1 1
1 \
-4
1
-2 /
which is the matrix A (only the last column is given after the vertical bar). If we try to simplify the matrix into the row echelon form, we must obtain a row
( 0   0   0 I a ) ,   where   a ^ 0.
We know, that the column on the right-hand side is not a linear combination of the columns on the left-hand side (the rank of the matrix is 4). This system thus has no solution. □ For further examples see 2.H.7
B. Permutations and determinants
In order to be able to define the key object of the matrix calculus, the determinant, we must deal with permutations (bijections of a finite set) and their parities. We shall use the two-row notation for permutations (see 2.2.1). In the first row we list all elements of the given set, and every column then corresponds to a pair (preimage, image) in the given permutation. Because a permutation is a bijection, the second row is indeed a permutation (ordering) of the first row, in accordance with the definition from combinatorics.
2.B.I.   Decompose the permutation
'\ 2345678 9^ v3 16789542.
into a product of transpositions.
Solution. We first decompose the permutation into a product of independent cycles. Start with the first element 1 and look on the second row to see what the image of 1 is. It is 3. Now look on the column that starts with 3, and see that the image of 3 is 6, and so on. Continue until we again reach the starting element 1. We obtain the following sequence of elements, which map to each other under the given permutation:
1i-^3i-^6i-^9i-^2i-^1.
The mapping which maps elements in such a manner is called a cycle (see 2.2.3) which we denote by (1,3, 6,9, 2).
Now choose any element not contained in the obtained cycle. With the same procedure as with 1, we obtain the cycle (4,7,5,8). From the method is clear that the result does not depend on the first obtained cycle. Each element from the set
is, all permutations on n elements. The resulting sequence satisfies the condition.
Note that the last sentence of the theorem does not seem to be useful in practice. But it is a very important part for proving the theorem by induction over the size of X.
It remains to prove the part of the theorem about parities. Consider the ordering
(ai,..., a,, aj+i,... , an),
containing r inversions. Then in the ordering
(ai,... , Oi+i, ai,. .., an)
there are either r — 1 or r +1 inversions. Every transposition (a{, a,j) is obtainable by doing (j — i) + (j—i — l) = 2(j — transpositions of neighbouring elements. Therefore any transposition changes the parity. Also, we already know that all permutations can be obtained by applying transpositions.
□
We found that applying a transposition changes the parity of a permutation and any ordering of numbers {1,2,... ,n} can be obtained through transposing of neighbouring elements. Therefore we have proven
Corollary. On every finite set X = {1,.
i} with n ele-
ments, n > 1, there are exactly |n! even permutations, and |n! odd permutations.
If we compose two permutations, it means first doing all transpositions forming the first permutation and then all the transpositions forming the second one. Therefore for any two permutations a, 77 : X —> X we have
sgn(tr orj) = sgn(tr) ■ sgn(n)
and also
sgrifo-"1) = sgn(tr).
2.2.3. Decomposing permutations into cycles. A good tool for practical work with permutations is the cycle decomposition, which is also a good exercise on the concept of equivalence.
Cycles
A permutation a over the set X = {1,..., 71} is called a cycle of length k, if we can find elements ai,..., ak e X, 2 < k < 71 such that u(ai) = ai+1, i = 1,..., k — 1, while u(ak) = ai, and other elements in X are fixed-points of a. Cycles of length two are transpositions.
Every permutation is a composition of cycles. Cycles of even length have parity — 1, cycles of odd length have parity 1.
Proof. The last claim has yet to be proved. Fix a permutation a and define a relation R such that \i, two elements x, y e X are i?-related if and only if de(x) = y for some iteration £ e Z of the permutation a (notice a~x means the inverse bijection to a). Clearly, it is an equivalence relation
86
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
({1,2,..., 9}) appears in one of the obtained cycles, we can thus write:
a = (1,3,6,9, 2)o(4,7,5,8),
a = (4,7,5,8)o(l,3,6,9, 2), since independent cycles commute. For cycles the decomposition into transpositions is simple, we have
(1,3, 6,9,2) = (l,3)o(3,6)o(6,9)o(9,2) = (1,3)(3,6)(6,9)(9,2). Thus we obtain:
a = (1,3)(3,6)(6, 9)(9, 2)(4,7)(7,5)(5,8).
□
Remark. The minimal number of transpositions in the decomposition of a permutation is obtained by carrying out exactly the procedure as above. That is, first decompose the permutation into the independent cycles, then the cycles canoni-cally into the transpositions. Thus the found decomposition is the decomposition into the minimal number of transpositions.
Note also that the operation o is a composition of mappings, thus it is necessary to carry out the composition "backwards", as we are used to in composition of mappings. Applying the given composition of transposition for instance on the element two we can successively write:
[(1,3)(3,6)(6,9)(9,2)](2) =
[(1,3)(3,6)(6,9)]((9,2)(2)) =
[(1,3)(3,6)(6,9)](9) = [(1,3)(3,6)](6) = (1,3)(3) = 1,
thus the mapping indeed maps the element 2 on the element 1
(it is actually just the cycle (1,3,6,9,2) written in a different
way). When writing a composition of permutations, we often
omit the sign "o" and speak of the product of permutations.
When writing the cycle we write only the elements on
which the cycle (that is, the mapping) nontrivially acts (that is,
the element is mapped to some other element). Fixed-points
of the cycle are not listed. Thus it is necessary to know on
which set do we consider the given cycle (mostly it will be
clear from the context). The cycle c = (4,7,5, 8) from the
previous example is thus a mapping (permutation), which, in
the two-row notation, looks like this
1 2345678 9 123786549
If the original permutation has some fixed-points they do
not appear in the cycle decomposition.
(check it carefully!). Because X is a finite set, for some £ it must be that cre(x) = x. If we pick one equivalence class {x, <j(x), ..., ae~1(x)} c X and define other elements to be fixed-points, we obtain a cycle. Evidently, the original permutation X is then the composition of all these cycles for individual equivalence classes and it does not matter in which order we compose the cycles.
For determining the parity we just have to note that cycles of even length can be written as a composition of an odd number of transposition, therefore their parity is —1. Analogously, cycle of odd length can be obtained using an even number of transpositions and therefore it has parity 1. □
2.2.4. Expansion of determinant. Our understanding of the permutations allows to find the expansion method of computing the determinants. The simple idea is to collect the terms containing an -fe-f^i-J-. element in a fixed row in the determinant sum and to add these contributions along the row.
Consider a matrix A = (a^) and let us look at all terms in \A\ containing the element an. By the very definition, these terms correspond to all permutations a with tr(l) = 1. Thus, the contribution of all these terms to \A\ is an An, where An is the determinant of the matrix obtained from A by omitting the first row and the first column.
Similarly, we can take any other fixed element a{j in A and look for the contribution of all terms containing it. Again, we could write A{j for the determinant of the matrix obtained from A by omitting the i-th row and the j-th column, and the latter contribution must have terms like in a{ j A{j, but we have
While the actual terms .. anCT(n) where the hat
to be very careful about the signs, of \ A\ would be sgaaaijaltJ^ .. .A denotes the omition of the i-th entry and a(i) = j, the sig natures of the permutations in A{j, with the i and j omitted might be different.
In order to compare it to the previous case i = 1, j = 1, we can change the initial ordering of the elements in the domain and target of the permutations a. Clearly, i — 1 changes on the domain and j — 1 changes on the target do the job (by "bubbling" the index in question to the first position by consecutive swaps of neighboring positions).
Thus, the sign correction is (—1)J+J~2 and we have to adjust the value of A{j as in the following algorithm, which is the simplest version of the more general Laplace expansion formula, see 2.2.9 below. The readers not sure about the details of our argumentation here may wait for the detailed proof in the more general situation.
87
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Note further that the notation (1, 2,3) gives the same cycle as for instance (2,3,1) or (3,1, 2). But the notation (1,3,2) is a different cycle.
2.B.2.   Determine the parity of the following permutations:
12345678 9 316789542
1 2   3   4   5 6
2 4   6   1   5 3
Solution. According to our definition (see 2.2.2) we compute the number of inversions of a: we go sequentially through the second row in the two-row notation and for every number k there we count the number of numbers which are smaller than k and are located after k in the second row. It is not hard to see that the number of inversions in a given permutation is exactly the number of pairs "larger before smaller" in the second row. For a we compute (stepping through the second row): after three there is one and two, thus we add 2; after one there is no smaller number and we add 0; after six there is five, four and two, thus we add 4, similarly for seven, eight and nine, for five we add 2, for four we add 1 and for two nothing. Thus we have 17 inversions in total and thus the permutation is odd.
But we can compute the parity of a otherwise. The theorem 2.2.2 implies that the parity of a permutation is given by the parity of the number of transpositions in its decomposition (this number is, unlike the number of transposition in an arbitrary decomposition, always the same)
The previous exercise gives us a = (1,3)(3,6)(6,9)(9,2)(4, 7) (7, 5)(5,8). There are seven transpositions in the decomposition, thus the permutation is indeed odd.
Alternatively we can decompose r into either a product of three transpositions (using the cycle decomposition):
t = (1,2,4)(3,6) = (1,2)(2,4)(3,6),
or we count the number of inversions in r: 1+2+3+0+1 = 7. Either way we find that r is an odd permutation.
In general, as soon as the decomposition to cycles is ready, we may just count the lengths of the cycles, since each cycle including k elements is clearly built of k — 1 transpositions and thus contributes (—l)fe_1 to the parity. □
For the following exercises, recall how to compute determinants of the type 2x2 (an ■ a22 — a12 ■ a2i) and 3x3 (Sarrus rule), see 2.2.1.
Expansion of determinant
The algebraic complement Aij of the element ai j in a matrix A is the (—1)J+J-multiple of the determinant of the matrix obtained from A by omiting the i-th row and the j-th column.
Fixing the i-th row or j-th column,
n n
\A\ = ^ 'cijjAjj,   \A\ = ^ 'Q-ijAjy
3=1 i=l
The latter formulae correspond to splitting the determinant sum to parts containg terms with the individual elements in the row or column.
For example, an easy application derives the Sarrus rule from the formula in dimension 2 now.
with elements a^ = a^. The matrix
2.2.5. Simple properties. Knowing the properties of per-mutations and their parities from previous para-l-'Scrrys8"   graphs allows us to derive quickly basic properties of determinants.
For every matrix A = (aij) of the type ra/n over scalars from K we define the transpose of A as the matrix AT = (a1^) AT is of the type n/ra
A square matrix A with the property A = AT is called symmetric. If A = —AT, then A is called antisymmetric.
Simple properties of determinants
Theorem. Every square matrix A = (a^) satisfies the following conditions:
(1) \AT\ = \A\.
(2) If one of the rows contains only zero elements from K, then \A\ = 0.
(3) If a matrix B was obtained from A by transposing two rows, then \ A\ = — \B\.
(4) If a matrix B was obtained from A by multiplying one row by a scalar a G K, then \B\ = a \ A\.
(5) If all elements of the k-th row in A are of the form akj = Ckj +bkj and all remaining rows in the matrices A,B = (hj), C = (cij) are identical, then \A\ = \B\ + \C\.
(6) A determinant | A | does not change if we add to any row of A a linear combination of other rows.
Proof. (1) The terms of determinants \A\ and \AT\ 1 are in bijective correspondence, where the term sgn(cr)a1(j(1) ■ a2a{2)''' ana{n) corresponds the following AT term (notice it does not depend on the order of scalars)
sgn(cr)a(T(1)1 ■ aCT(2)2 • • • aCT(n)n =
= sgnl0')0!^-1^) ' a2CT-1(2) ''' anCT-i(n),
and we have to ensure that this member has the correct sign. But the parities of a and a ~1 are the same, and so this is really a term in the determinant \ AT | and the first claim is proved.
88
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.B.3. Compute the determinant of the following matrices
1 2
2 1
A	2	3\	/I	1	1\
i	-1	CM	1	0	0
	2	2/	V-2	0	1/
o
Solution. The determinant of the first matrix is 1 ■ 1 — 2 ■ 2 =
-3.
As for the second matrix, according to the Sarrus rule we just have to enumerate the expression l-(-l)-2+2-2-3+3-l-2-3-(-l)-3-l-2-2-l-2-2 = 17.
We can also bring the matrix into the row echelon form and then multiply the numbers on the diagonal but we have to remember that a multiplication of a row with a scalar changes the determinant of the matrix by the same multiple. Interchanging two rows changes the sign of the determinant of the matrix.
1	2	3	
1	-1	2	=
3	2	2	
1
"l2
1 0 0
1 2 0 12 0 0
2 3
-3 -1 -4 -7
3
4 -17
1 1 ~ ' 3
1	2	3
0	12	4
0	-12	-21
We finish with an upper triangular matrix. The determinant of such matrices is the product of the numbers on the main diagonal. So the result is -^(1 ■ 12 ■ (-17)) = 17.
We can see, that using the Sarrus rule is quicker.
For the third matrix we have l-0-l+l-0-l+l-0-(-2)-l-0-(-2)-l-M-l-0-0 = -1.
□
It is important to realize, that Sarrus rule can be used for matrices 3x3 only. For higher dimension matrices you can either bring the matrix to the row echelon form (where you have to take in to account rules 2.2.5) or use the Laplace expansion (see 2.2.9).
2.B.4.   Compute the determinant of the matrix
A	3	5	
i	2	2	2
i	1	1	2
\o	1	2	V
Solution. We compute this in two ways. First, convert the matrix to row echelon form. We can use already known elementary transformations,
(2) This comes straight from the definition of determinant, because all its terms contain exactly one member from every row. Thus, if one of the rows is zero, all terms of the determinant are also zero.
(3) The only change in the terms of |J3| compared to \ A\ is the addition of one transposition in all permutations, therefore all the signs will be reversed.
(4) This follows straight from the definition, because terms of |J3| are just terms of \A\ multiplied by the scalar a.
(5) In every term of | A |, there is exactly one element from the fc-th row of the matrix A. By the distributive law for multiplication and addition in K, the claim follows directly from the definition of determinant.
(6) If there are two identical rows in A, then there are always two identical terms among all terms in the determinant, up to the sign. Therefore in this case \A\ = 0. Thus, by (5), we can add any other row to the given row, without changing the value of the determinant. In view of the claims (4) and (5), we can in fact add a scalar multiple of any other row. □
2.2.6. Computational corollaries. By the previous theo-iry _ rem, we can use elementary row transformations to bring any square matrix A into _ row echelon form, without changing the «£f value of its determinant. We just have to be careful and add only linear combinations of other rows to a given one.
Thus let us look at the distribution of the elements in the individual terms of a determinant \A\ with dimension of A equal to n > 1. There is just one term with all of its elements on the diagonal. In all other terms, there must be elements both above and below the diagonal (if we place one element outside of the diagonal, we block two diagonal entries and we leave only n — 2 diagonal positions for the other n — 1 elements).
Therefore, if the matrix A is in a row echelon form, then every term of \A\ is zero, except the term with exclusively diagonal entries. This proves the following algorithm:
Computing determinants using elimination If A is in the row echelon form then
\A\ = an ■ a22 ■ ■ ■ ■ ann.
The previous theorem gives an effective method for computing determinants using the Gauss elimination method, see the paragraph 2.1.7.
Notice that the very same argumentation allows us to stop the elimination having the first k columns in the requested form and finding the determinant of the matrix B of dimension n—k in the right bottom corner of A in another way. The result will then be \ A\ = an ■ a22 ■ ■ ■ a^k ■ \B\.
Let us note a nice corollary of the first claim of the previous theorem about the equality of the determinants of the matrix and its transpose. It ensures that whenever we prove some claim about determinants formulated in terms of rows
89
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
1	3		5	6				1	1	1	2			1	1	1	2
1	2		2	2				1	2	2	2			0	1	1	0
1	1		1	2				1	3	5	6			0	2	4	4
0	1		2	1				0	1	2	1			0	1	2	1
		1	1		1	2			1	1	1	2		1	1	1	2
		0	1		1	0			0	1	1	0		0	1	1	0
		0	0		2	4			0	0	1	1		0	0	1	1
		0	0		1	1			0	0	2	4		0	0	0	2
= 2.
Note, that we have interchanged the rows twice in the course of computation.
The other way of computing the determinant is by cofac-tor expansion along the first column (the one with the greatest number (one) of zeroes). Successively we obtain
1112 0   12 1
	2	2	2		3	5	6
1 ■	1	1	2	- 1 ■	1	1	2
	1	2	1		1	2	1
using the Sarrus rule
-2-2 + 6 = 2.
□
2.B.5.   Compute the determinant of the matrix
/l   0   1   0 1\
0 2 0   0 3 4   0 0
\0   0   0   0 5/
Solution. We notice, that the last (fifth) row contains four zeros (as well as the second column). It is the most, we can find in a row or a column in the matrix, thus it will be advantageous to use Laplace theorem (2.3.10) and compute the determinant via expasion along the fifth row or second column. We present the expansion via fifth row: 10   10 1
0 2 0 2 0
0 0 3 0 3
4 0 0 4 4
0 0 0 0 5
+0
10 0 1
0   2 2 0
0   0 0 3
4   0 4 4
	0	1	0	1
0-	2	0	2	0
	0	3	0	3
	0	0	4	4
	1	0	1	1
0-	0	2	0	0
	0	0	3	3
	4	0	0	4
0	1	0		
2 0	0 3	2 0	= 5	
0	0	4		
-0-
+ 5'
1	1	0 1
0	0	2 0
0	3	0 3
4	0	4 4
1	0	1 0
0	2	0 2
0	0	3 0
4	0	0 4
1	0	
3	0	= 120
0	4	
of the corresponding matrix, we immediately obtain an analogous claim in terms of the columns.
For instance, we can immediately formulate all the claims (2)-(6) for linear combinations of columns.
As a useful (theoretical) illustration of this principle, we shall derive the following formula for direct cal-/^t^-y^? culation of solutions of systems of linear equa-tions. For sake of simplicity, we shall work with Vllefe..    gey 0f scalars now.
Cramer rule
Proposition. Consider the system of n linear equations for n variables with matrix of the system A = (a^) and the column of values b = (&!,..., bn). In matrix notation this means we are solving the equation A - x = b.
If there exists the inverse \ A\_1, then the individual components of the unique solution x = (xi,..., xn) are given as
%i — \ A{ \ \A\ , where the matrices A\ arise from the matrix A of the system by replacing the i-th column by the column b of values.
Proof. As we have already seen, working over field of scalars the inverse of the matrix of the system exists if and only if the system has a unique solution, and this in turn happens if and only if \A\~X exists. If we have such a solution x, we can express the column b in the matrix A{ by the corresponding linear combination of the columns of the matrix A, that is the values &fc = a^ix^-----\-a,knxn. Then, by subtracting the 2^-multiples of all the other £-\h columns from this i-th column in A{, we arrive at just the x{ -multiple of the original column of A. The number Xi can thus be brought in front of the determinant to obtain the equation \A{\ = x{ \A\, and thus A^||^4|_1 = a^l^H^I-1 = Xi, which is our claim. □
Notice also that the properties (3)-(5) from the previous theorem say that the determinant, (considered as a mapping which assigns a scalar to n vectors of dimension n), is an antisymmetric mapping linear in every argument, exactly as we required in analogy to the 2-dimensional case.
2.2.7. Further properties of the determinant. Later we will see that, exactly as in the dimension 2, the determinant of the matrix equals to the (oriented) volume of the parallelepiped determined by the columns of the matrix. We shall also see that considering the mapping x i-> A ■ x given by the square matrix A on R™ we can understand the determinant of this matrix as expressing the ratio between the volume of the parallelepipeds given by the vectors x1,.. .xn and their images A ■ x\:..., A • xn.
Because the composition a; i-> Ax i-> B-(Ax) of mappings corresponds to the matrix multiplication, the Cauchy theorem below is easy to understand:
90
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
where we have used the expansion along the second column in the second step and computed the determinant of the 3 x 3 matrix directly using the Sarrus rule.
Another option is to try to expand the determinant along several rows, exploiting vanishing of many sub-determinants there. For example, we may use the last two rows. Clearly there might be only two non-zero sub-determinants built from this row there. Thus the entire determinant must be (notice that choosing two lines and two columns always leads to the plus sign in the definition of the algebraic complement, see 2.3.10)
4 4 0 5
Cauchy theorem
0	1	0		A	A		i	0	1
2	0	2	+	4 0	4 r.		0	2	0
0	3	0			0		0	0	3
20-0 + 20-6 = 120
□
2.B.6.   Find all the values of a such that
a   1   1 1
a 1 1 1 a 1 0   0 -a
1.
D
D = a- (-a)
2/ 2
-a (a
For complex a give either its algebraic or polar form. Solution. We compute the determinant by expanding the first row of the matrix:
a   1   1 1 0   a   1 1 Ola    1   ~ ' 0   0   0 -a
Expand further using the last row:
a 1 1 a
We conclude that a4 — a2 + 1 = 0. Substituting t = a +2    t + 1 with roots 11 =
l-iV3
we have t
isin(7r/3), Í2
cos(7t/3)
= cos(7t/3) + isin(7r/3) =
cos(—7r/3) + i sin(—7r/3), from where we obtain four possible values for the parameter a: ai = cos(7r/6)+i sin(7r/6) = VS/2 + i/2, a2 = cos(77r/6) + i sin(77r/6) = — \/3/2 — i/2, a-j = cos(—7r/6) + isin(—7r/6) = V/3/2 — i/2, = cos (5tt/6) + i sin(57r/6) = -VS/2 + i/2.
Alternatively, we can multiply by a2 + 1 to obtain
a6 + l
(a2 + l)(a4-a2 + l) = 0.
The equation a6 = — 1 has six (complex) solutions given by
a = cos p + i simp where p = 7r/6 + kir/S = (2k + l)7r/6, k = 0,1,2,3,4,5. Of these, we must discard the two choices k = I, and k = 4, since these choices solve a2 + 1 = 0 and
Theorem. Let A = (a^), B = (pij) be square matrices of dimension n over the ring of scalars K. Then
\A-B\ = \A\ ■ \B\.
In the next paragraphs, we derive this theorem in a purely algebraic way, in particular because the previous argumentation based on geometrical intuition could hardly work for arbitrary scalars. The basic tool is the determinant expansion using one or more of the rows or columns which we have seen in simplest case of single rows or columns in 2.2.4.
We will also need a little technical preparation. The reader who is not fond of too much abstraction can skip these paragraphs and note only the statement of the Laplace theorem and its corollaries.
Notice also, the claims (2), (3) and (6) from the theorem 2.2.5 are easily deduced from the Cauchy theorem and the representation of the elementary row transformations as multiplication by suitable matrices (cf. 2.1.8).
2.2.8.
Minors of the matrix. When investigating matrices and their properties we often work only with parts of the matrices. Therefore we need some new concepts.
sub matrices and minors
LetA= (ay) be a matrix of the type ra/n and let 1 < i\ < ... < ik < m, 1 < ji < ... < ji < n be fixed natural numbers. Then the matrix
M
likh "-ikj-A ■ ■ ■ "-ikji/ of the type k/l is called a submatrix of the matrix A determined by the rows i\,..., ik and columns j±,..., ji. The remaining (m — k) rows and (n — I) columns determine a matrix M* of the type (m—k) / (n—l), which is called complementary submatrix to M in A. When k = I we call the determinant \ I the subdeterminant or minor of the order k of the matrix A. If m = n and k = I, then M* is also a square matrix and \M*\ is called the minor complement to \M\, or complementary minor of the submatrix M in the matrix A. The scalar
( —l)il_l—^k+ji-i—kn . |Jkf*|
is then called the algebraic complement of the minor \M\.
The submatrices formed by the first k rows and columns are called leading principal submatrices, and their determinants are called leading principal minors of the matrix A. If we choose k sequential rows and columns starting with the i-th row, we speak of principal matrices and principal minors.
91
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
not a4 — a2 + 1 = 0. We conclude that a = cos ip + i sin ip where y> = (2fc + l)7r/6, A; = 0,2, 3, or 5. □
2.B.7. Yandermonde determinant. Prove the formula for the Vandermonde determinant, that is, the determinant of the Vandermonde matrix:
Vr, =
where x\,... ,xn £t and on the right-hand side of the equation there is the product of all terms Xj — x{ where j > i.
Solution. We proceed by induction on n. From technical reasons we work with the transposed Vandermonde matrix (it has the same determinant). By subtracting the first row from all other rows and then expanding the first column we obtain
1	1 .	1	
Xl	x2 ■	xn	
x\	x2	X2 ^n	~      TT Xi Kí<j'<rí
xT1	x2 .	™n—1 ^n	
Vn{x1,x2,
1 0
Xl x2 - Xl
0 xn Xi xn x-^ x2 — Xl    x\ — x\
2 2 Xji      Xl     Xn X-^
Xn
„1-1
— X „1-1
n-1 1
„1-1
If we take out xi+i — x\ from the i-th row for i G {1, 2,..., n — 1}, we obtain
Vn(xi, X2, . . . , Xn) = (x2 - Xl) ■ ■ ■ (xn - Xl)
1 X2+Xi
1 Xn+Xi
n—2   n—j—2 j
E'
Z_ŕ»—n ■
n—j—2 3 Ll
By subtracting from every column (starting with the last and ending with the second) x i -multiple of the previous column, we obtain
1    X2 + Xi
n—2   n—j—2 j
1 Xn+X1
j = 0 ^Ti
n-j-2 j
1 X2
1 X?)
Specially, when k = £ = 1, m = nwe call the corresponding algebraic complementary minor the algebraic complement Aij of the element a{ j of the matrix A, which we met already in 2.2.4.
2.2.9. Laplace determinant expansion. If the principal minor \M\ of the matrix A is of the order k, then, \, directly from the definition of the determinant, each of the individual k\(n — k)\ terms in the product of \M\ with its algebraic complement is a term of \A\.
In general, consider a square submatrix M, that is, a square matrix given by the rows ii < i2 < ■ ■ ■ < ik and
columns ji < ■ ■ ■ < jk- Then using (ji — 1) H-----h (ik — k)
exchanges of neighbouring rows and (ji — 1) H-----h (jk — k)
exchanges of neighbouring columns in A we can transform this submatrix M into a principal submatrix and the complementary matrix gets transformed into its complementary matrix.
The whole matrix A gets transformed into a matrix B satisfying (cf. 2.2.5 and the definition of the determinant)
\B\ = (-l)a\A\,Whema = j:kh=1(ih+jh)-2(l+---+k). But (-1)Q = (-l'f with/3 = J2kh=1(ih +jh). Therefore we have checked:
Proposition. If A is a square matrix of dimension n and \ M is its minor of the order k < n, then the product of any term of\M\ with any term of its algebraic complement is a term in the determinant \A\.
This claim suggests that we could perhaps express the determinant of the matrix by using some products of smaller determinants. We see that \A\ contains exactly n\ distinct terms, exactly one for each permutation. These terms are mutually distinct as polynomials in the components of a general matrix A. If we can show that there are exactly that many mutually distinct expressions from the previous claim, we obtain the determinant \A\ as their sum.
It remains to show that the terms of the product M | ■ | M*\ contain exactly n\ distinct members from \A\.
From the chosen k rows we can choose (£) minors M and using the previous lemma each of the k\(n — k)\ terms in the products of \M\ with their algebraic complements is a term in \A\. But for distinct choices of M we can never obtain the same terms and the individual terms in (-l)^+--+^+n+-+ji . \m\ ■ \m*\ are also mutually distinct. Therefore we have exactly the required number k\(n — k)! (™) = n\ of terms, and we have proved:
Laplace theorem
Theorem. LetA= (a^) be a square matrix of dimension n over arbitrary ring of scalars with k rows fixed. Then \ A\ is a sum of all (™) products (-1)21 +"'+lk +^ +"'    ■ | M \ ■ | M* of minors of the order k chosen among the fixed rows with their algebraic complements.
92
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Therefore
Vn(x1,x2, ■■■,xn)
= (x2 - Xl) ■ ■ ■ (xn - Xl) Vn-l(x2, ■ ■ ■ ,Xn)
The Laplace theorem transforms the computation of \A\ into the computation of determinants of lower dimension. This method of computation is called the Laplace expansion along the chosen rows (or columns). For instance, the expansion along the i-th row or the j-th column is:
Because it is clear that
V2 (Xn— 1, Xn)      Xn      Xn—i,
it follows by induction that
Vn(x1,X2, ■ ■ ■ ,Xn) =      Y[ (Xj-Xi).
Note that the determinant is non-zero whenever the numbers xi,..., xn are mutually distinct. □
Remark. Another (more beautiful?) proof of the formula can be found in 5.1.5.
2.B.8.   Find whether or not the matrix
(3	2 -1	2\
4	1 2	-4
-2	2 4	1
V2	3 -4	8/
is invertible.
Solution. The matrix is invertible (that is, there is an inverse matrix) whenever we can transform it by elementary row transformations into the unit matrix. That is equivalent for instance to the property that it has non-zero determinant. That we can compute using the Laplace Theorem (2.3.10) by expanding for instance the first row:
3 2-12
4 12-4 -2 2    4 1
2 3-48
1    2 -4 = 3- 2    4 1 3-4 8
-2-
4
-2 2
2 4
-4 8
-4 4 1
1   +(-!)■ -2 2
4 1 2 -2 2 4 2 3-4
=3 ■ 90 - 2 ■ 180 + (-1) ■ 110 - 2 ■ (-100) = 0, that is, the given matrix is not invertible.
□
2.B.9. Solve the system from 2.A.2 using the Cramer rule (see 2.2.6).
\A I — ^^aijAij — cijjAj
j=i i=i
where A{j are the algebraic complements of the elements a{j (that is, minors of order one), as deduced in 2.2.4 already.
In practical computations, it is often efficient to combine the Laplace expansion with a direct method of Gaussian elimination.
2.2.10. Proof of the Cauchy theorem. The theorem is *Sw based on a clever but elementary application of the Laplace theorem. We just use the Laplace expansion twice on a particular arrangement of a well chosen matrix. Consider first the following matrix H of dimension 2n (we are using the so-called block symbolics, that is, we write the matrix as if composed of the (sub)matrices A, B, and so on).
/ ai
H ■
A 0
-E B
O-nl -1
air,
0
0
0 bu
0 \ 0
\ 0 -1     bnl   ...   bnn J
The Laplace expansion along the first n rows gives
\H\ = \A\ ■ \B\.
Now in sequence, we add linear combinations of the first n columns to the last n columns in order to obtain a matrix with zeros in the bottom right corner. We obtain
/ an
K :
«711 -1
V o
ain Cn
Cnl
0
Cln \
C-nr
o
o
o J
The elements of the submatrix on the top right part must satisfy
cij = anbij + ai2b2j + ■ ■ ■ + ciinbnj,
that is, they are exactly the components of the product A ■ B and \K\ = \H\. The expansion of the last n columns gives us
\K\ = {-\)n{-\)1+-+2n\A-B\ = {-\)2n<n+1)-\A-B\ = \A ■ B\. This proves the Cauchy theorem.
93
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Solution. We just plug in the values to the rule:
2     2 3
-3 -3 -1 -3    1 2
Xi
1 2 3
2 -3 1 -3    1 2
1, x2
1 2 3
2 -3 -1 -3 -3 2
1 2 3~"
2 -3 1
-3 1 2
2:3
1
2
-3
2 2
-3 -3 1
-3
1 2 3
2 -3 1 -3    1 2
□
2.B.10. Find the algebraically adjoint matrix and the inverse of the matrix
/I 0
A-
0\
0   3   0 4 5   0   6 0 \0   7   0 8/
Solution. The adjoint matrix is
A*
(An A12 A21 A22 A31 A32
A13 A23 A33
A14\
A24 A34
\A41   A42   AA3 A44j
where A{j is the algebraic complement of the element a{j of the matrix A, that is, the product of the number (—and the determinant of the matrix given by A without the i-th row and j-th column. We have
	3	0	4					0	0	4	
Au =	0	6	0		,-24,	A12 =		5	6	0	= 0,
	7	0	8					0	0	8	
	0	3	4					0	3	0	
A13 =	5	0	0		= 20,	A14 =		5	0	6	= 0,
	0	7	8					0	7	0	
		0	2	0			1	2	0		
A21 =		0	6	0	= 0,	^22 =	5	6	0		= -32,
		7	0	8			0	0	8		
		1	0	0			1	0	2		
^23 =		5	0	0	= 0,	A24 =	5	0	6		= -28,
		0	7	8			0	7	0		
	0	2	0					1	2	0	
A3i =	3	0	4			a32 =		0	0	4	= -0,
	7	0	8					0	0	8	
	1	0	0					1	0	2	
A33 =	0	3	4			^34 =		0	3	0	= -0,
	0	7	8					0	7	0	
2.2.11. Determinant and the inverse matrix. Assume first that there is an inverse matrix of the matrix A, that is, A ■ A-1 = E. Since the unit matrix always satisfies \E\ = 1, it follows that for every invertible matrix its determinant is an invertible scalar and by the Cauchy theorem we have \A_1\ = \A\_1.
But we can say more, combining the Laplace and Cauchy theorems.
Inverse matrix determinant formula
For any square matrix A = (a^) of dimension n we define a matrix A* = (a*j), where aj- = Aj{ are algebraic complements of the elements in A. The matrix A* is called the algebraically adjoint matrix of the matrix A.
Theorem. For every square matrix A over a ring of scalars K we have that
(1) AA* = A*A = \A\ ■ E.
In particular,
(i) A~x exists as a matrix over the ring of scalars K if and only if \A I ~1 exists in K.
(ii) If A'1 exists, then A'1 = \ A\~r ■ A*.
Proof. As already mentioned, the Cauchy theorem shows that the existence of A-1 implies the invertibility of \A\ G K.
For an arbitrary square matrix A we can directly compute A ■ A* = (c,,), where
kakj
n
^ ' aikAjk-
If i = j, it is exactly the Laplace expansion of \A\ along the i-th row.
If i 7^ j, then we may imaging we expand the determinant alogn the j-th row, but plug in the values of the i-th row instead of the a,Vs. This is the expansion of the determinant of a matrix where the i-th and j-th row is the same, therefore
Cij = 0.
This implies that A - A* = \A\ ■ E, and we have proven one of the equalities (1). In particular, if \A\_1 exists, then A ■ (lA-1^*) =E.
lf\A\ is an invertible scalar, we may repeat the previous computation for A* ■ A, and we obtain (\A\~1A*) - A = E. Therefore our computation really gives the inverse matrix of A, as claimed in the theorem. □
Notice that for fields of scalars we have already proved that the right inverse of a matrix is automatically the left inverse and thus the inverse, too. Here we have obtained the same result for all rings of scalars, together with a strong and effective existence condition. On the other hand the exact formula for the inverse has become rather theoretical with little practical value.
94
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Ai
A3 = -
0 2 0
3 0 4 = 0,
0 6 0
1 0 0
0 3 4 = 0,
5 0 0
Ai
1 2 0
0 0 4 5 6 0
1 0 2 Ai4 = 0 3 0
5 0 6
-16,
= -12.
By substitution we obtain
A*
0	20	0 ^	T
-32	0	28	
0	-4	0	
16	0	-12)	
0	8	° ^	
-32	0	16	
0	-4	0	
28	0	-12)	
/-24 0 8
V o
/-24 0
20
V o
We compute the inverse matrix A-1 from the relation A-1 = |^41 1 - A*. The determinant of the matrix A is (expanding the first row) equal to
Al =
10 2 0
0   3 0 4
5   0 6 0
0   7 0 8
3	0	4		0	3	4
0	6	0	+ 2	5	0	0
7	0	8		0	7	8
= 16.
By substitution, we obtain
/-3/2    0 1/2 0 \
0-2      0 1
5/4     0 -1/4 0
\  0 7/4     0 -3/4/
A~x =
□
C. Vector spaces, examples
Typical properties of vector spaces (met already in the plane or three dimensional space) can be observed in many other situations. We illustrate 11 this by examples.
2.C1. Vector space - yes or no? Decide whether following sets form a vector space over the field of real numbers:
i) The set of solutions of the system
X1+X2~\-----h a;98 + Xgg + Xioo =100X1,
X1+X2-\-----1-298 +£99 =99x1,
xi+x2-\-----1-298 =98xi,
x1 + x2
= 2xi.
As a direct corollary of this theorem we can once again prove the Cramer rule for solving the systems of linear equations, see 2.2.6. Really, for the solution of the system A-x = b we just need to read in the equation
A
AI_1A -b
the individual components of the expression A* ■ b as the Laplace expansions of the determinant of the matrix A which arose through the exchange of the i-th column of A for the column b.
3. Vector spaces and linear mappings
2.3.1. Abstract vector spaces. Let us go back for a while to the systems of m linear equations of n variables from 2.1.3 and further, let us assume that the system is the homogeneous system A ■ x = 0, that is
an
^ml
By the distributivity of the matrix multiplication it is clear that the sum of two solutions x = (xi,..., xn) and y = (yi,...,yn) satisfies
A-(x + y)=A-x + A-y = 0
and thus is also a solution. Similarly, a scalar multiple a ■ x is also a solution. The set of all solutions of a fixed system of equations is therefore closed under vector addition and scalar multiplication. These are the basic properties of vectors of dimension n in K™, see 2.1.1. Now we have the vectors in the solution space with 71 coordinates. The "dimension" of this space is given by the difference of the number of variables and the rank of the matrix A. Thus we can easily deal with the solution of a system of 1000 equations in 1000 variables and need only one or two free parameters. Thus the whole solution space will behave as a plane or a line, as we have already seen in 1.5.3 at the page 30, although the vectors themselves are given by so many components.
We go further. Already in paragraph 1.2.1 we have encountered an interesting example of a space of all solutions of a homogeneous linear difference equation of first order. All solutions have been obtained from a single one by scalar multiplication and are also closed under addition and scalar multiples. These "vectors" of solutions are infinite sequences of numbers, although we intuitively expect that the "dimension" of the whole space of solutions should be one. We shall understand such phenomena with the help of a more general definition of vector space and its dimension.
95
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
ii) The set of solutions of the equation
x1+ x2-\----+ x100 = 0
iii) The set of solutions of the equation
xi + 2x2 + 3^3 H-----h lOOxioo = 1-
iv) The set of all real (or complex) sequences. (Real or complex sequence is a mapping / :N^Ror/:N^C. The image of number n is then called n-th member of the sequence, we usually denote it by lower index, say an.)
v) The set of solutions of a homogeneous difference equation.
vi) The set of solutions of a non-homogeneous difference equation.
vii) {/:R-j.R|/(l) = /(2)=c,ceR}
Solution. We check the properties of a vector space, see 2.3.1. Actually all we have to do is to check whether the given sets are closed to linear combinations of it's elements. Then all the axioms of a vector space are satisfied.
i) Yes. They all are real multiples of the vector (1,1,1..., 1).  A sum of two multiples of the same
100 ones
vector is again a multiple of the vector. The reverse vector is again a multiple of the vcetor and all other axioms are trivially satisfied. By the way, the solution space is thus a vector space of dimension 1, see also 2.3.7.
ii) Yes. It is a space of dimension 99 (corresponds to the number of free parameters of the solution). In general the set of all solutions of any system of homogeneous linear equations forms a vector space.
iii) No. For instance, taking twice the solution a;i = l,xi = 0, i = 2,... 100 we do not obtain a solution. But the set of solutions forms an affine space (see 4.1.1).
iv) Yes. The set of all real or complex sequences clearly forms a real (complex) vector space. Adding the sequences and scalar multiplication is defined term-wise, where it is clearly the vector space of all real (complex) numbers.
v) Yes. In order to show that the set of sequences which satisfy given difference homogeneous equation it is enough to show that it is closed under addition and real number multiplication (as the set of all real sequences is a vector space, as we know). Consider two sequences (xj)J±0
Vector space definition
A vector space V over a field of scalars K is a set where we define the operations
• addition, which satisfies the axioms (CG1)-(CG4) from the paragraph 1.1.1 on the page 5,
• scalar multiplication, for which the axioms (V1)-(V4) from the paragraph 2.1.1 on the page 72 hold.
Recall our simple notational convention: scalars are usually denoted by letters from the beginning of the alphabet, that is, a,b,c,..., while for vectors we shall use letters from the end, that is, u, v, w, x, y, z. Usually, x, y, z will denote n-tuples of scalars. For completeness, the letters from the centre of the alphabet, for instance k, £, will mostly denote indices.
In order to gain some practice in the formal approach, we
check some simple properties of vectors. These are trivial for n-tuples for scalars, but not so evident for general vectors in
our new abstract sense.
2.3.2. Proposition. Let V be a vector space over a field of scalars K. Suppose a, &, a, G K, and u,v,Uj G V. Then
(1) a ■ u = 0 if and only if a = 0 oru = 0,
(2) (-1) -u = -u,
(3) a ■ (u — v) = a ■ u — a ■ v,
(4) (a — 6) ■ u = a ■ u — b ■ u,
(5) (J27=i ai) ■ (EJli Uj) = J27=i T,T=i a* ' u3-Proof. We can expand
(a + U) ■ u = a-u + U- u = a- u which, according to the axiom (CG4), implies 0 ■ u = 0. Now
u + (-1) ■ u (=2) (1 + (-1)) ■ u = 0 ■ u = 0 and thus — u = (—1) ■ u. Further,
.        .    _      , (v2,v3) , ,
a ■ (u + ( — 1) ■ v)    =    a ■ u + (—a) ■ v = a ■ u — a ■ v, which proves (3). It follows that
(V2,V3)
(a — b) ■ u    =    a ■ u + (—b) ■ u = a ■ u — b ■ u
which proves (4). Property (5) follows using induction with (V2) and (VI).
It remains to prove (1): a ■ 0 = a ■ (w — u) = a ■ u — a ■ u = 0, which along with the first derived proposition in this proof proves one implication. For the other implication, we use an axiom for the field of scalars, and axiom (V4) for vector spaces: if p ■ u = 0 and p / 0, then u = 1 ■ u = (p-1 -p])■ u = p~x ■ 0 = 0. □
2.3.3. Linear (in)dependence. In paragraph 2.1.11 we worked with linear combinations of rows of a matrix. With vectors we work analogously:
96
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
and (yj)j^0 satisfying the given equation, that is,
anXn+k + dn-lXn+k-l + ■ ■ ■ + CloXk     = 0
anyn+k + an-iyn+k-i H----+ a0yk   = 0.
By adding these equations, we obtain
an{xn+k + Vn+k) + an-l{xn+k-l + Vn+k-l)
H-----h a0(xk +yk) = 0,
therefore also the sequence (xj + yj)°^0 satisfies the given equation. Analogously, if the sequence (xj)J±0 satisfies the given equation, then also (uxj)JL0, where
U G R.
vi) No. The sum of two solutions of a non-homogeneous equation
anXn+k + CLn-lXn+k-l + ' ' ' + «(j2fc = c
anyn+k + an-iyn+k-i H-----h a0yk = c, c G R - {0}
satisfies the equation
an(xn+k + yn+k) + an-i(xn+k-1 + yn+k-i) H-----h a0(xk + yk) = 2c,
that is, it does not satisfy the original non-homogeneous equation. But the set of solutions forms an affine space, see 4.1.1.
vii) It is a vector space if and only if c = 0. If we take two functions / and g from the given set, then (/ + g) (1) = (/ + g){2) = /(l) + = 2c. Thus if / + g is to be a member of the given set, it must be that (/ + g) (1) = c, therefore 2c = c, hence c = 0.
□
2.C.2.   Find out, whether the set
Ui = {(zi, 2:2,2:3) G R3; 21 I = I x2 I = I 23 |} is a subspace of a vector space R3 and the set
U2 = {ax2 + c; a, c G R}
a subspace of the space of polynomials of degree at most 2.
Solution. The only property we have to check is whether the given subset is closed under linear combination of vectors in it, that is if it forms a vector space. The set Ui is not a vector (sub)space. We can see that, for instance,
(1,1,1)+ (-1,1,1) = (0,2,2) i Ul
Linear combination and independence
An expression of the form a1 v1 + ■ ■ ■ + ak vk is called a linear combination of vectors vi,..., vk G V.
A finite sequence of vectors v1,... ,vk is called linearly independent, if the only zero linear combination is the one with all coefficients zero. That is, for any scalars ai,..., ak G K, ai vi + ■ ■ ■ + ak vk = 0 implies ai = a2 = ■ ■ ■ = ak = 0. It is clear that for an independent sequence of vectors, all vectors are mutually distinct and nonzero.
The set of vectors M c Vina vector space V over K is called linearly independent, if every finite fc-tuple of vectors v1,..., vk G M is linearly independent.
The set of vectors M is linearly dependent, if it is not linearly independent.
A nonempty subset M of vectors in a vector space over a field of scalars K is dependent if and only if one of its vectors can be expressed as a finite linear E'kC^f combination using other vectors in M. This follows directly from the definition. At least one of the coefficients in the corresponding linear combination must be nonzero, and since we are over a field of scalars, we can multiply whole combination by the inverse of this nonzero coefficient and thus express its corresponding vector as a linear combination of the others.
Every subset of a linearly independent set M is clearly also linearly independent (we require the same conditions on a smaller set of vectors). Similarly, we can see that M C V is linearly independent if and only if every finite subset of M is linearly independent.
2.3.4. Generators and subspaces. A subset M C V is called a vector subspace if it forms, together with the restricted operations of addition and .: scalar multiplication, a vector space. That is, we require
Va, b G K, Vw, w G M, a ■ v + b ■ w G M.
We investigate a couple of cases: The space of m-tuples of scalars Rm with coordinate-wise addition and multiplication is a vector space over R, but also a vector space over Q. For instance for m = 2, the vectors (1, 0), (0,1) G R2 are linearly independent, because from
a- (1,0) + b- (0,1) = (0,0)
follows a = b = 0. Further, the vectors (1,0), {y/2, 0) G R2 are linearly dependent over R, because v7^ -(1,0) = (v7^, 0), but over Q they are linearly independent! Over R these two vectors "generate" a one-dimensional subspace, while over Q the subspace is "larger".
Polynomials with real coefficients and of degree at most m form a vector space Rm [2]. We can consider the polynomials as mappings / : R —> R and define the addition and scalar multiplication like this: (/ + g)(x) = f(x)+ g(x), (a ■ f)(x) = a ■ f(x).
97
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
The set [72 is a subspace (there is a clear identification with R2), because
(aix2 + ci) + (a2x2 + c2) = (a1 + a2) x2 + (ci + c2),
k ■ (ax2 + c) = (fca) x2 + fee for all numbers a1,c1, a2, c2, a, c, fc G R. □
D. Linear (in)dependence
2.D.I. Determine whether or not the vectors (1,2,3,1), (1,0, -1,1), (2,1, -1,3) and (0, 0,3,2) are linearly independent.
Solution. Because
12 3 1 10-11
2 1 0 0
-1 3 3 2
= 10^0,
the given vectors are linearly independent.
□
2.D.2. Given arbitrary linearly independent vectors u, v, w, z in a vector space V, decide whether or not in V the vectors
u — 2v,   3u + w — z,   u — Av + w + 2z,   4i> + 8w + Az
are linearly independent.
Solution. Considered vectors are linearly independent if and only if the vectors (1, -2, 0,0), (3,0,1, -1), (1, -4,1,2), (0,4,8,4) are linearly independent in R4. We have
1-200
3 0 1-1 1-412
0 4    8 4
= -36 ^ 0, thus the vectors are linearly independent.
□
2.D.3.   The vectors
(1,2,1),   (-1,1,0), (0,1,1)
are linearly independent, and therefore together form a basis of R3 (for basis it is important to give an order of the vectors). Every three-dimensional vector is therefore some linear combination of them. What linear combination corresponds to the vector (1,1,1), or equivalently, what are the coordinates of the vector (1,1,1) in the basis formed by the given vectors?
Solution. We seek a, b, c G  R such that a(l, 2,1) +
b(-l, 1,0) + c(0,1,1) = (1,1,1). The equation must hold
Polynomials of all degrees also form a vector space R[x] (or Rqo [x]) and Rm[x] c R„[x] is a vector subspace for any m < n < 00. Further examples of subspaces is given by all even polynomials or all odd polynomials, that is, polynomials satisfying f(—x) = ±f(x).
In complete analogy with polynomials, we can define a vector space structure on a set of all mappings R —> R. or of all mappings M —> V of an arbitrary fixed set M into the vector space V.
Because the condition in the definition of subspace con-(Ji 1, sists only of universal quantifiers, the intersection of subspaces is still a subspace. We can see this also directly: Let Wi, i G 7, be vector subspaces in V, fS 1 a,b eK, u,v e nieIWi. Thena-u + b-v G W{ for all i G I. Hence a ■ u + b ■ v G rijg/Wi.
It can be noted that the intersection of all subspaces W C V that contain some given set of vectors M C V is a subspace. It is called span M.
We say that a set M generates the subspace spanM, or that the elements of M are generators of the subspace span M.
We formulate a few simple claims about subspace generation:
Proposition. For every nonempty set M C V, we have
(1) span M = {ai ■ ui + ■ ■ ■ + ak ■ u^; i; £ N, a, £ K, Uj £ M,j = l,...,k}:
(2) M = span M if and only if M is a vector subspace;
(3) if N C M then span N C span M is a vector subspace; the subspace span 0 generated by the empty subspace is the trivial subspace {0} C V.
Proof. (1) The set of all linear combinations
aiwi H----+ akuk
on the right-hand side of (1) is clearly a vector subspace and of course it contains M. On the other hand, each of the linear combinations must be in span M and thus the first claim is proved.
Claim (2) follows immediately from claim (1) and from the definition of vector space. Analogously, (1) implies most of the third claim.
Finally, the smallest possible vector subspace is { 0 }. Notice that the empty set is contained in every subspace and each of them contains the vector 0. This proves the last claim. □
Basis and dimension
A subset M C V is called a basis of the vector space V if span M = V and M is linearly independent.
A vector space with a finite basis is called finitely dimensional. The number of elements of the basis is called the dimension ofV.
If V does not have a finite basis, we say that V is infinitely dimensional. We write dim V = k, k G N or k = 00.
In order to be satisfied with such a definition of dimension, we must know that different bases of the same space will
98
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
in every coordinate, so we have a system of three linear equations in three variables:
a-b       = 1 2a + b + c   = 1 a + c   = 1,
whose solution gives uso= \,b= —\,c= \, thus we have
=     (1,2,1) - | • (-1,1,0) + i • (0,1,1),
that is, the coordinates of the vector (1,1,1) in the basis
'l _i 1-.
i2'     2' 21
((1,2,1), (-1,1,0), (0,1,1)) are (i,-i,i). □
2.D.4.   Determine all constants a e R such that the polyno-C253> mialsax2+x+2, — 2x2+ax+3andx2+2x+a are linearly dependent (in the vector space P3 [x] of polynomials of one variable of degree at most three over real numbers).
Solution. In the basis 1, x, x2 the coefficients of the given vectors (polynomials) are (a, 1,2), (—2, a, 3), (1,2, a). Polynomials are linearly independent if and only if the matrix whose columns are given by the coordinates of the vectors has a rank lower than the number of the vectors. In this case the rank must be two or less. In the case of a square matrix, a rank less than the number of rows means that the determinant is zero. The condition for a thus reads
= 0,
6a
that is, a is a root of the polynomial a6
-•2 — a — 5), thus there are 3 such constants ai
1-V21
l)(a2
(a + -- -1, □
2.D.5. Consider the complex numbers C as a real vector space. Determine the coordinates of the number 2 + i in the basis given by the roots of the polynomial x2 + x + 1.
Solution. Because roots of the given polynomial are — \ + i-^r and — 5 — i-^r, we have to determine the coordinates (a,b) of the vector 2 + i in the basis (-| + i^, - \ -i^). These real numbers a, b are uniquely determined by the condition
, 1   -v^  , , 1   .Vs.   n .
a ■ (---\-i—) + b ■ (---1—) = 2 + i.
K   2       2 '       K   2       2 '
By equating separately the real and the imaginary parts of the
equation, we obtain a system of two linear equations in two
always have the same number of elements. We shall show this below. But we note immediately, that the trivial subspace is generated by the empty set, which is an "empty" basis. Thus it has dimension zero.
The linearly independent vectors
e, = (0,...,l,...,0)6K",   i = l,...,n
(all zeros, but one value 1 at the i-th position) are the most useful example of a basis in the vector space K". We call it the standard basis of Kn.
2.3.5. Linear equations again. It is a good time now to re-,©i call the properties of systems of linear equation /'*£~~,^ff:' in terms of abstract vector spaces and their bases, "'j^tt;/ As we have already noted in the introduction to this section (cf. 2.3.1), the set of all solutions of the homogeneous system
A ■ x = 0
is a vector space. If A is a matrix with m rows and n columns, and the rank of the matrix is k, then using the row echelon transformation (see 2.1.7) to solve the system, we find that the dimension of the space of all solutions is exactly n — k.
Indeed, the left hand side of the equation can be understood as the linear combination of the columns of A with coefficients given by x and the rank k of the matrix provides the number of linearly independent columns in A, thus the dimension of the subspace of all possible linear combinations of the given form. Therefore, after transforming the system into row echelon form, exactly m — k zero rows remain. In the next step, we are left with exactly n — k free parameters. By setting one of them to have value one, while all others are zero, we obtain exactly n — k linearly independent solutions. Then all solutions are given by all the linear combinations of these n — k solutions. Every such (n — fc)-tuple of solutions is called a. fundamental system of solutions of the given homogeneous system of equations. We have proved:
Proposition. The set of all solutions of the homogeneous system of equations
A ■ x = 0
for n variables with the matrix A of rank k is a vector sub-space in Kn of dimension n — k. Every basis of this space forms a fundamental system of solutions of the given homogeneous system.
Next, consider the general system of equations
A-x = b.
Notice that the columns of the matrix A are actually images of the vectors of the standard basis in I" under the mapping assigning the vector A ■ x to each vector x. If there should be a solution, b must be in the image under this mapping and thus it must be a linear combination of the columns in A.
If we extend the matrix A by the column b, the number of linearly independent columns and thus also rows might increase (but does not have to). If this number increases, then b is not in the image and the system of equations does not have
99
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
variables:
The solution gives us a
-I'
V3
b   = 1.
-2+^,ř>:
the coordinates are (—2 + -4j, — 2 —^=
^p, therefore
□
2.D.6. Remark. As a perceptive reader may have spotted, the problem statement is not unambiguous - we are not given the order of the roots of the polynomial, thus we do not have the order of the basis vectors. The result is thus given up to the permutation of the coordinates.
We add a remark about rationalising the denominator, that is, removing the square roots from the denominator. The authors do not have a distinctive attitude whether this should always be done or not (Does ^ look better than ^g?)- In some cases the rationalising is undesirable: from the fraction -4= we can immediately spot that its value is a little greater
than 1 (because V35 is just a little smaller than 6), while for the rationalised fraction we cannot spot anything. But in general the convention is to normalize.
2.D.7. Consider complex numbers C as a real vector space. Determine the coordinates of the number 2 + i in the basis given by the roots of the polynomial x2 — x + 1. O
2.D.8. For what values of the parameters a, b, c e K are the vectors (1,1, a, 1), (1, b, 1,1), (c, 1,1,1) linearly dependent?
o
2.D.9. Let a vector space V be given along with a basis formed by the vectors u, v, w, z. Determine whether or not the vectors
u — 3i> + z,   v — 5w — z,   3w — 7z,   u — w + z
are linearly independent. O
2.D.10. Complete the vectors 1 — x2 + x3, 1 + x2 + x3, 1 — x — x3 to a basis of the space of polynomials of degree at most 3. O
2.D.11.   Do the matrices
1 0 \ A4 1   -2 J'    lo -1
-5 0\ 3    0 '
1 -2 0 3
form a basis of the vector space of square two-dimensional matrices?
a solution. If on the other hand the number of linearly independent rows does not change after adding the column b to the matrix A, it means that b must be a linear combination of the columns of A. Coefficients of such combinations are then exactly the solutions of our system.
Consider now two fixed solutions x and y of our system and some solution z of the homogeneous system with the same matrix. Then clearly
A-(x-y) = b-b = 0 A ■ (x + z) = b + 0 = b.
Thus we can summarise in the form of the so called Kronecker-Capelli theorem1:
Kronecker-Capelli Theorem
Theorem. The solution of a non-homogeneous system of linear equations A ■ x = b exists if and only if adding the column b to the matrix A does not increase the number of linearly independent rows. In such a case the space of all solutions is given by all sums of one fixed particular solution of the system and all solutions of the homogeneous system that has the same matrix.
2.3.6. Sums of subspaces. Since we now have some intuition about generators and the subspaces generated by them, we should understand the possi-^^r:r_ bilities of how some subspaces can generate the whole space V.
Sum of subspaces
Let Vi, i G I be subspaces of V. Then the subspace generated by their union, that is, span Ui^iV,, is called the sum of subspaces V,. We denote it as W = J2iei ^- Notably, for a finite number of subspaces V\,..., Vk C V we write
W = Vi H-----h Vk = span(Vi U V2 U ■ ■ ■ U Vk).
We see that every element in the considered sum W can be expressed as a linear combination of vectors from the sub-spaces Vi. Because vector addition is commutative, we can aggregate summands that belong to the same subspace and for a finite sum of k subspaces we obtain
Vi + V2 + ■ ■ ■ + Vk = {vi + ■ ■ ■ +vk; vt eVt,i = l,...,k}.
The sum W = Vi + ■ ■ ■ + Vk C V is called the direct sum of subspaces if the intersection of any two is trivial, that is, Vi n Vj = {0} for alH ^ j. We show that in such a case,
1A common formulation of this fact is "system has a solution if and only if the rank of its matrix equals the rank of its extended matrix". Leopold Kro-necker was a very influential German Mathematician, who dealt with algebraic equations in general and in particular pushed forward Number Theory in the middle of 19th century. Alfredo Capelli, an Italian, worked on algebraic identities. This theorem is equally often called by different names, e.g. Rouche-Frobenius theorem or Rouche-Capelli theorem etc. This is a very common feature in Mathematics.
100
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Solution. The four given matrices are as vectors in the space of 2 x 2 matrices linearly independent. It follows from the fact that the matrix
(I     1    -5 1\
0     4     0 -2
10     3 0
\-2   -10 3 /
is invertible (which is by the way equivalent to any of the following claims: its rank equals its dimension; it can be transformed into the unit matrix by elementary row transformations; it has the inverse matrix; it has a non-zero determinant (equal to 116); it stands for a system of homogeneous linear equations with only zero solution; every non-homogeneous linear system with left-hand side given by this matrix has a unique solution; the range of a linear mapping given by this matrix is a vector space of dimension 4 - this mapping is in-jective). □
2.D.12. In the vector space dimensional subspaces
we are given three-
U = span{ui,u2,u3},       V = span{«i, v2, v3},
while
	f1)		f1)		f1)		
	1		1		0		i
lil =	1		0		i		-i
	w		w		v)		
v2 = (1, -1,1, -1)T, v3 = (1, -1, -1,1)T. Determine the dimension and find a basis of the subspace U n V.
Solution. The subspace U n V contains exactly the vectors that can be obtained as a linear combinations of vectors u{ and also as a linear combination of vectors v{. Thus we search for numbers x1, x2, x3,y±, y2, y3 G R such that the following holds:
	H		H				/M		/M		
	i		i		0		i		-1		-1
	i	+X2	0	+X3	1	= vi	-i	+V2	1	+V3	-1
	w		v)		w						W
that is, we are looking for a solution of a system
xi   + x2
+	x3 =	Vi	+	V2	+	V3,
	=	Vi	—	V2	—	V3,
+	x3 =	-Vi	+	V2	—	V3,
+	x3 =	-Vi	-	V2	+	V3-
every vector w G W can be written in a unique way as the sum
w = v1 H-----\-vk,
where Vi G Vi. Indeed, if we could simultaneously write w
as w = v[ + ■ ■ ■ + v'k, then
0 :
— w = (i>i — v±) H-----\-(vk- vk).
If vi — v'{ is the first nonzero term of the right-hand side, then this vector from Vi can be expressed using vectors from the other subspaces. This is a contradiction to the assumption that Vi has zero intersection with all the other subspaces. The only possibility is then that all the vectors on the right-hand side are zero and thus the expression of w is unique. For direct sums of subspaces we write
W = V1 ffi ■ ■ ■ ffi vk - ^k
®LiV.
2.3.7.
Basis. Now we have everything prepared for understanding minimal sets of generators as we understood them in the plane R2 and to prove the promised indepence of the number of basis elements on any choices. A basis of a fc-dimensional space will usually be denoted as a fc-tuple v = (v1..., Vk) of basis vectors. This is just a matter of convention: with finitely dimensional vector spaces we shall always consider the bases along with a given order of the elements, even if we have not defined it that way (strictly speaking).
Clearly, if (v1,..., vn) is a basis of V, then the whole space V is the direct sum of the one-dimensional subspaces
V = span{«i} © • • • © span{vn}.
An immediate corollary of the derived uniqueness of decomposition of any vector w in V into the components in the direct sum gives a unique decomposition
W = X\V\ + ■ ■ ■ + xnvn.
This allows us, after choosing a basis, to see the abstract vectors again as n-tuples of scalars. We shall return to this idea in paragraph 2.3.11, when we finish the discussion of the existence of bases and sums of subspaces in the general case.
2.3.8. Theorem. From any finite set of generators of a vector space V we can choose a basis. Every basis of a finitely dimensional space V has the same number of elements.
Proof. The first claim is easily proved using induction jjfi i, on the number of generators k.
Only the zero subspace does not need a generator and thus we are able to choose an empty basis. On the other hand, we are not able to choose the zero vector (the generators would then be linearly dependent) and there is nothing else in the subspace.
In order to have our inductive step more natural, we deal with the case k = l first. We have V = span{w} and u/0, because {v} is a linearly independent set of vectors. Then {v} is also a basis of the vector space V and any other vector
101
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Using matrix notation of this homogeneous system (and preserving the order of the variables) we have
/l   1   1   —1   -1 -1\ 110-11 1 0 11-11 111 1-1/ 1     1-1-1 -1\
1
Í1 1
0 0
0 -
\0 1
'1 1
0 1
0 0
\0 0
-1
0
2 0
1 1
-1 -1 1 1
2 2
-V
-i\
-i
2
1 1 1 1
Í1 0
0 0 1 \0   0 0
0
0 0
1 1 1 1
1
\0   0 0 0   0 0
-1
1 -1
-2 -2 1
0 1 0 0
0 0
1
\0   0 0 We obtain a solution
1/
0 0 \
0 -2 -2 -2
1 iy
0 2 ^
2 0
-2 -2
1 1/
xi = -2i, 22 = -2s, a;3 = 2s + 2i, y1 = -s - i, y2 = s, y3 = f, i,sGR.
We obtain a general vector of the intersection by substituting
i/x1 + x2 + x3\      /     0 ^
X\ + X2 — It — 2s
V
x2 + x3
J
V
2s 2t
J
We see that
dim U n V = 2,
C/ n K = span j
-1 1
-1 o
Vi/
□
2.D.13. Let there be in R3 two vector spaces U and V generated by the vectors
(1,1, -3), (1, 2,2)   and   (1,1, -1), (1, 2,1), (1, 3,3),
respectively. Determine the intersection of these two sub-spaces.
Solution. According to the definition of intersection, the vectors in the intersection are in both, the span of the vectors
is a multiple of v, so all bases of V must contain exactly one vector, which can be chosen from any set of generators.
Assume that the claim holds for k = n and consider V = span{«i,..., vn+1}. If v1,..., vn+1 are linearly independent, then they form a basis. If they are linearly dependent, there exists i such that
v{ = a1v1 H-----h aj_i«j_i + ai+1vi+1 H-----h an+1vn+1.
Then V = span{i>i,..., t^-i, vi+i,..., vn+i} and we can choose a basis, using the inductive assumption.
In remains to show that bases always have the same number of elements. Consider a basis v = (v1,..., vn) of the space V and for an arbitrary nonzero vector u, consider
u = a±vi H-----h anvn G V
with di ^ 0 for some i. Then
= — (u-(ai«iH-----haj_iWj_i+ai+iWj+iH-----\-anvn))
(hi
and therefore also span{u, vi,..., Vi-i,vi+i,..., un} = V.
We show that this is again a basis. For if adding u to the linearly independent vectors v1,..., vi-1,vi+1,..., vn leads to a set of linearly dependent vectors, then
V = span{«i,... ,Vi-Uvi+1,. . .,vn},
which implies a basis of n — 1 vectors chosen from v, which is not possible.
Thus we have proved that for any nonzero vector u G V there exists i, 1 < i < n, such that (u, i>i,..., Vi-i,vi+i,..., vn) is again a basis of V.
Similarly, instead of one vector u, we can consider a linearly independent set «i,..., uk- We will sequentially add ui, «2, • • •, always exchanging for some v{ using our previous approach. We have to ensure that there always is such v{ to be replaced (that is, that the vectors u{ will not consequently replace each other).
Assume thus that we have already placed «i,..., ui instead of some v/s. Then the vector ui+1 can be expressed as a linear combination of the latter vectors u{ and the remaining Vj's. As we have seen, ui+1 may replace any vector with non-zer coefficient in this linear combination. If only the coefficients at ui,..., ui were nonzero, then it would mean that the vectors «i,..., ui+1 were linearly dependent, which is a contradiction.
Summarizing, for every k < n we can arrive after k steps at a basis in which k vectors from the original basis were exchanged for the new u^'s. If k > n, then in the n-th step we would obtain a basis consisting only of new vectors u{, which means that the original set could not be linearly independent.
In particular, it is not possible for two bases to have a different number of elements. □
In fact, we have proved a much stronger claim, the Steinitz exchange lemma:
102
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(1,1,-3), (1,2,2), as well as in the span of the vectors (1,1, —1), (1,2,1), (1,3,3). It helps to consider first the geometry. Firstly, U is spanned by two linearly independent vectors. So U is a plane in R3. Next, V is spanned by three vectors. But these are linearly dependent since
1	1	1		1	1	-1
1	2	3	=	1	2	1
-1	1	3		1	3	3
So V is also a plane.
If the vector (2:1,2:2,2:3) lies in U, then (21,22,23) = A(l,l,—3) + /i(l,2,2) for some scalars A, fi. Similarly (21,22,23) lies in V, so (21,22,23) = q(1,1,-1) + 0(1,2,1) + 7(1,3,3) for scalars q,/3, 7. When written in full, this is a set of six equations in eight unknowns. Solving these is possible but can be quite cumbersome. Some simplification is obtained as follows:
The first three equations, which describe U are
21 = A + fi
22 = A + 2/i
23 = —3A + 2/i
If we solve these three equations for the two "unknowns" A and [i, (which in any case we do not want), or alternatively if we eliminate A and fi, from these equations, we obtain the single equation 821 — 522 + 23 = 0 to replace the first three. The second set of three equations, which describe V are
21 = a + (3 + 7
22 = a + 2/3 + 7
23 = — a + (3 + 37
If we solve these three equations for the three "unknowns" a (3 and 7, (which in any case we do not want), or alternatively if we eliminate a j3 and 7, from these equations, we obtain the single equation 321— 222+23 = 0 to describe V. Introducing the parameter t, it is straightforward write the solution as the line (21,22,23) = i(3, 5,1).
□
Now we move to unions of vector spaces. There is a simple algorithm, how to chose the maximal linearly independent set of vectors out of a given set of vectors. Write the given vectors as columns in a matrix. Then tranform the matrix with row tranformation into the row echelon forms. The vectors, who correspond to the columns, where the "stairs" begin,
Steinitz Exchange Lemma
For every finite basis v of a vector space V and every set of linearly independent vectors u{, i = 1,..., k in V we can find a subset of the basis vectors Vi which will complete the set of U0S into a new basis.
2.3.9. Corollaries of the Steinitz lemma. Because of the
possibility of freely choosing and replacing basis vectors we can immediately derive nice (and intuitively expectable) properties of bases of vector spaces:
Proposition. (1) Every two bases of a finite dimensional vector space have the same number of elements, that is, our definition of dimension is basis-independent.
(2) If V has a finite basis, then every linearly independent set can be extended to a basis.
(3) A basis of a finite dimensional vector space is a maximal linearly independent set of vectors.
(4) The bases of a vector space are the minimal sets of generators.
A little more complicated, but now easy to deal with, is the situation of dimensions of subspaces and their sums:
Corollary. Let W, Wx, W2 C V be subspaces of a space V of finite dimension. Then
(1) dim W < dim V,
(2) V = W if and only (/dim V = dim W,
(3) dim Wx + dim W2 = dim(Wx + W2)+ dim(Wx l~l W2).
Proof. It remains to prove only the last claim. This is evident if the dimension of one of the spaces is zero. Assume dim Wx = r > 1, dim W2 = s > -1. 1 anc^    (Wl • • •, wt) be a basis of W\ n W2 (or
■S-3«^ " empty set, if the intersection is trivial). According to  the  Steinitz  exchange lemma this basis of the intersection can be extended to a basis (wx, ■ ■ ■, wt, ut+x..., ur) for Wx and to a basis (wx ■ ■ ■, wt,vt+1, ...,vs) for W2. Vectors
Wx,..., wt,ut+1,..., ur, vt+1 ...,vs
clearly generate W\ + W2. We show that they are linearly independent. Let
axwx H----+ atwt + bt+xut+x + ■ ■ ■
----h brur + ct+1vt+1 H-----h csvs = 0.
Then necessarily
- (ct+i ■ vt+i H-----h cs • tis) =
= ax ■ wx H-----h at ■ wt + bt+1 ■ ut+1 H-----h br ■ ur
must belong to W2 n Wx. This implies that
h+x = ■ ■ ■ = br = 0, since this is the way we have defined our bases. Then also ax ■ wx H-----\-at-wt + ct+1 ■ vt+1 H-----h cs ■ vs = 0
103
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
form a maximal linearly independent set. To justify this, just think of the system of linear equations describing that a linear combination of the given vectors is zero. The matrix of the system is exactly the one described. If you put in the system only vectors corresponding to columns where are stairs, you get a system which can be transformed into a one in row echelon form with non-zero numbers on the diagonal and thus only solution of the systems are zeros, that is the vectors are linearly independet. Similarly, the system together with any of the vectors which correspond to "no stair" columns has a lower rank than the number of variables (coefficients of a linear combination), thus according to 2.3.5 has a nontrivial (non-zero) solution.
2.D.14. Determine the vector subspace (of the space R4) generated by the vectors ui = (—1,3,-2,1), w2 = (2,-1,-1,2), u3 = (-4,7,-3,0), uA = (1,5,-5,4), by choosing a maximal set of linearly independent vectors Ui (that is, by choosing a basis).
Solution. Write the vectors Ui into the columns of a matrix and transform it using elementary row transformations. This way we obtain
/-l    2    -4 1	\			2	0	4\
3-175			-1	2	-4	1
-2   -1   -3 -5			3	-1	7	5
\ 1     2     0 4	/		Y"2	-1	-3	-5/
/l    2     0     4 \		(I 2		0		
0 4-45		0 1		-1	5/4	
0-7 7-7		0 1		-1	1	
\0    3    -3    3 /		\p 0		0	o )	
/l   2    0       4 \		Í®		0	2	0 \
0   1   -1 5/4			0	©	-1	0
0   0    0 -1/4			0	0	0	©
\0   0    0      0 /		V o		0	0	0/
And (according to the algorithm) it follows that the vectors corresponding to the columns with circled elements, namely vectors u\, w2 and «4 form a maximal linearly independent set. □
Remark. Note, that the maximal set of linearly independent vectors is not unique. Unique is only the number of vectors in it (the dimension of the vector space generated by the given vectors). For example from vectors (1, 0), (0,1), (1,1) you can pick any two to form a maximal linearly independent set, from vectors (1,0), (2,0), (0,1). This fact is also reflected in
and because the corresponding vectors form a basis W2, all the coefficients are zero.
The claim (3) now follows by directly counting the generators. □
2.3.10. Examples. (1) I" has (as a vector space over K) dimension n. The 71-tuple of vectors
((1,0,...,0),(0,1,...,0)...,(0,...,0,1))
is clearly a basis, called the standard basis o/K".
Note that in the case of a finite field of scalars, say with k prime, the whole space K" has only a finite number kn of elements.
(2) C as a vector space over R has dimension 2. A basis is for instance the pair of numbers 1 and i, or any other two complex numbers which are not a real multiple of each other, eg. 1 + i and l — i.
(3) Km[a;], that is, the space of all polynomials with coefficients in K of degree at most m, has dimension m + 1. A basis is for instance the sequence 1, x, x2,..., xm.
The vector space of all polynomials K[x] has dimension 00, but we can still find a basis (although infinite in size):
(4) The vector space R over Q has dimension 00. It does not have a countable basis.
(5) The vector space of all mappings / : R —> R has also dimension 00. It does not have any countable basis.
2.3.11. Vector coordinates. If we fix a basis (v1,..., vn) of
'-i0%k' a fi^te dimensional space V, then every vector \-4^J      w eV can be expressed as a linear combination
-'i^fcii^t   v = aiVl -----Mn in a unique way. Indeed,
'^fe-f^s-j— assume that we can do it in two ways:
w = ai«i H-----h anvn = Ml H-----h bnvn.
Then
0 = (ai — &i) ■ i>i H-----h (a„ - bn) ■ vn
and thus a{ = b{ for all i = 1,..., 71, because the vectors v{ are linearly independent. We have reached the concept of coordinates:
Coordinates of vectors
Definition. The coefficients of the unique linear combination expressing the given vector w e V in the chosen basis v = (vi,..., vn) are called the coordinates of the vector w in this basis.
Whenever we speak about coordinates (ai,..., an) of a vector w, which we express as a sequence, we must have a fixed ordering of the basis vectors v = (v±,..., vn). Although we have denned the basis as a minimal set of generators, in reality we work with them as with sequences (that is, with ordered sets).
104
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
the algorithm, becouse it is independent of an order, in which you put the given vectors as columns in a matrix.
2.D.15.   Find a basis of the subspace
[7 = span j (3 4
of the vector space of real matrices 3x2. Extend this basis to a basis of the whole space.
Solution. Recall that a basis of a subspace is a set of linearly independent vectors which generate given subspace. By writing the entries of the matrices in a row, we can consider the matrices as vectors in R6. In this way, the four given matrices can be identified with the rows of the matrix
/I 0
-1 V-2
2 1 0
-1
5 4
3/
It is easy to show that this matrix has rank 2, and hence that the subspace U is generated just by the first two matrices, which consequently form a basis for U. In fact, it follows easily that
-1 •
= -2 ■
There are many options for extending this basis to be a basis for the whole space. One option is to choose the first two of the given matrices together with the last four (actually, any four would do) of the six linearly independent matrices
(i  o\   /o i
0 0,0 0 \0   0/    \o 0
0 o\   /o  o\   /o  o\   /o 0
1 0,0 1,0 0,0 0 0   0/    \0   0/    \1   0/    \o 1
. Linear independence of these six matrices is established by computing
1 2 3 4 5 6
0 1 2 3 4 5
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
1^0.
Clearly the dimension is 6, so spanning is automatic, and hence we have a basis. □
Assigning coordinates to vectors
A mapping assigning the vector v = a1v1 + ■ ■ ■ + anvn to its coordinates in the basis v will be denoted by the same symbol v : V —> K". It has the following properties:
(1) v_(u + w) = v.(u) + v.{w)\ Vu, w E V,
(2) v(a-u) = a- v(u); Va G K, Vu G V.
Note that the operations on the two sides of these equations are not identical. Quite the opposite; they are operations on different vector spaces! S Sometimes it is really useful to understand vectors as mappings from fixed set of independent generators to coordinates (without having the generators ordered). In this way, we may think about the basis M of infinite dimensional vector spaces V. Even though the set M will be infinite, there can be only a finite number of non-zero values for any mapping representing a vector. The vector space of all
polynomials Koo a good example.
with the basis M = {1,
}is
2.3.12. Linear mappings. The above properties of the assignments of coordinates are typical for what we f 7   have called linear mappings in the geometry of ká-~:^ the plane R2. For any vector space (of finite or infinite dimension) we define "linearity" of a mapping between spaces in a similar way to the case of the plane I
J2.
Linear mappings
Let V and W be vector spaces over the same field of scalars K. The mapping / : V —> W is called a linear mapping, or homomorphism, if the following holds:
(1) f(u + v) = f(u) + f(v), Vu,veV
(2) f(a ■ u) = a ■ f(u), Va G K, Vu G V.
We have seen such mappings already in the case of matrix multiplication:
x h-> A ■
with a fixed matrix A of the type ra/n over K.
The image of a linear mapping, Im/ = /(V) C W, is always a vector subspace, since for any set of vectors Ui, the linear combination of images is the image of the linear combination of the vectors Ui with the same coefficients.
Analogously, the set of all vectors Ker / = /_1({0}) C V is a subspace, since the linear combination of zero images will always be a zero vector. The subspace Ker / is called the kernel of the linear mapping f.
A linear mapping which is a bijection is called an isomorphism.
Analogously to the abstract definition of vector spaces, it is again necessary to prove seemingly trivial claims that follow from the axioms:
105
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
£. Linear mappings
How can we describe simple mappings analytically? For example,how can we describe a rotation, an axial symmetry, a mirror symmetry, a projection of a three-dimensional space onto a two-dimensional one in the plane or in the space? How can we describe the scaling of a diagram? What do they have in common? These all are linear mappings. This means that they preserve a certain structure of the space or a subspace. What structure? The structure of a vector space. Every point in the plane is described by two coordinates, every point in the 3-dimensional space is described by three coordinates. If we fix the origin, then it makes sense to say that a point is in some direction twice that far from the origin as some other point. We also know where arrive at if we translate or shift by some amount in a given direction and then by some other amount in another direction. These properties can be formalized -we speak of vectors in the plane or in space, and we consider their multiplication and addition. Linear mappings have the property that the image of a sum of vectors is a sum of the images of the vectors. The image of a multiple of a vector is the same multiple as the image of the vector. These properties are shared among the mappings stated at the beginning of this paragraph. Such a mapping is then uniquely determined by its behaviour on the vectors of a basis. (In the plane, a basis consists of two vectors not on the same line. In space a basis consists of three vectors not all in the same plane).
How can we write down some linear mapping / on a vector space VI For simplicity, we start with the plane R2. Assume that the image of the point (vector) (1,0) is (a, b) and the image of the point (vector) (0,1) is (c, d). This uniquely determines the image of an arbitrary point with coordinates (u, v): f((u, v)) = /Ml, 0)+v(0,1)) = uf(l, 0)+ vf(l, 0) = (ua, ub) + (yc, vd) = (au + cv, bu + dv). This can be written down more efficiently as follows:
a c b d
au + cv bu + dv
A linear mapping is thus a mapping uniquely determined (in a fixed basis) by a matrix. Furthermore, when we have another linear mapping g given by the matrix ^ "Q, then we can easily compute (an interested reader can fill in the details by himself) that their composition gofis given by the matrix
'a b , c d
e f g h
ae + f c ag + ch
be + df bg + dh
Proposition. Let f : V —> W be a linear mapping between two vector spaces over the same field of scalars K. The following is true for all vectors u, ui,..., uk G V and scalars
ai,..., ak e K
m /(o) = o,
(2) f(-u) = -f(u),
(3) f(a1-u1-\-----\-ak-uk) = a1-f(u1)-\-----\-ak-f(uk),
(4) for every vector subspace V\ C V, its image f(V{) is a vector subspace in W,
(5) for every vector subspace W\ C W, the set f~1(W{) = {v £ V; f(v) £ Wi} is a vector subspace in V.
Proof. We rely on the axioms, definitions and already proved results (in case you are not sure what has been used, look it up!):
/(0) = f(u -u) = /((l - 1) ■ u) = 0 ■ f(u) = 0, /(-«) =/((-1)-U) = (-1). f{u) = -f{u).
Property (3) is derived easily from the definition for two summands, using induction on the number of summands.
Next, (3) implies span /(Vi) = /(Vi), thus it is a vector subspace. On the other hand, if j(u) e W\ and j(y) e W\ then for any scalars we arrive at f(a ■ u + b ■ v) = a - j(u) + b-f(v)eW!. □
2.3.13. Proposition (Simple corollaries). (1) The composition g o / : V —> Z of two linear mappings f : V —> W and g : W —> Z is again a linear mapping.
(2) The linear mapping f : V —> W is an isomorphism if and only if Imf = W and Ker/ = {0} C V. The inverse mapping of an isomorphism is again an isomorphism.
(3) For any two subspaces V\, V2 C V and linear mapping
f:V^W,
f(y1 + v2) = f(y1) + f(y2), f(v1nv2) c/(Vi)n/(v2).
(4) The "coordinate assignment" mapping u : V —> Kn given by an arbitrarily chosen basis u = (ui,... ,un) of a vector space V is an isomorphism.
(5) Two finitely dimensional vector spaces are isomorphic if and only if they have the same dimension.
(6) The composition of two isomorphisms is an isomorphism.
Proof. Proving the first claim is a very easy exercise left to the reader. In order to verify (2), notice that / is surjective if and only if Im / = W. ^ If Ker/ - {0} then f(u) - /(w) ensures
s^s*-^ f(u — v) = 0, that is, u = v. In this case / is injective. Finally, if / is a linear bijection, then the vector w is the preimage of a linear combination au + bv, that is w = f~1(au + bv), if and only if
f(w) = au + bv = f(a ■ f-\u) + b ■ /"»).
106
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
This leads us to the definition of matrix multiplication in exactly this way. That is, an application of a mapping on a vector is given by the matrix multiplication of the matrix of the mapping with the given vector, and that the mapping of a composition is given by the product of the corresponding matrices. This works analogously in the spaces of higher dimension. Further, this again shows what has already been proven in (2.1.5), namely, that matrix multiplication is associative but not commutative, just as with mapping composition. That is another motivation to study vector spaces.
Recall that already in the first chapter we worked with the matrices of some linear mappings in the plane R2, notably with the rotation around a point and with axial symmetry (see 1.5.8 and 1.5.9).
We try now to write down matrices of linear mappings from R3 to R3. What does the matrix of a rotation in three dimensions look like? We begin with some special (easier for description) rotations about coordinate axes:
2.E.I. Matrix of rotation about coordinate axes in R3.
We write down the matrices of rotations by the angle p, about the (oriented) axes x, y and z in R3.
Solution. When rotating a particular point about the given axis (say x), the corresponding coordinate (x) does not change. The remaining two coordinates are then given by the rotation in the plane which we already know (a matrix of the type 2 x 2).
Thus we obtain the following matrices - rotation about the axis z:
(cos p   — sin p 0\ sin p    cos p    0 I 0 0 l)
rotation about the axis x:
ao        0 \ I 0   cos p   — sin p I . \ 0   sin p     cos p J
rotation about the axis y:
(cos p     0 sinijsX 0 10 — sin p   0   cos p J
Note the sign of p in the matrix for rotation about y. We want, as with any other rotation, the rotation about the y axis to be in the positive sense — that is, when we look in the opposite direction of the direction of the y axis, the world turns anticlockwise. The signs in the matrices depend on the orientation of our coordinate system. Usually, in the 3-dimensional
Thus we also get w = a/-1 (u) + bj~x (v) and therefore the inversion of a linear bijection is again a linear bijection.
The third property is obvious from the definition, but try finding an example showing that the inequality in the second equation can indeed by sharp.
The remaining claims all follow immediately from the definition. □
2.3.14. Coordinates again. Consider any two vector spaces V and W over K with dim V = n, dim W = m and consider some linear mapping / : V —> W. For every choice of basis u = (ui,..., un) on V, v = (v1 ,...,«„) on W there are the following linear mappings as shown in the diagram:
V -
f
■w
The bottom arrow /„_„ is denned by the remaining three, i.e. the composition of linear mappings
fu,v = v o / o yT1.
Matrix of a linear mapping
Every linear mapping is uniquely determined by its values on an arbitrary set of generators, in particular, on the vectors of a basis u. Denote by
f(ui) = an ■ i>i + a2i • v2 H-----h amlvm
f(u2) = a 12 ■ vi + a22 ■ v2 H-----h am2vm
f(un) = ain ■ i>i + a2n ■ i>2 H-----h amnvm,
that is, scalars a{j form a matrix A, where the columns are coordinates of the values f(uj) of the mapping / on the basis vectors expressed in the basis v on the target space W.
A matrix A = (a^) is called the matrix of the mapping f in the bases u, v.
For a general vector u = xiui + ■ ■ ■ + xnun e V we calculate (recall that vector addition is commutative and distributive with respect to scalar multiplication)
f(u) = x1f(u1) H-----h xnf(un)
= x1(a11v1-\-----\-amlvm) H-----h xn(alnV!-\----)
= (ziaiiH-----hx„ain)wi H-----h (xiamH----)vm.
Using matrix multiplication we can now very easily and clearly write down the values of the mapping fu,v(w) defined uniquely by the previous diagram. Recall that vectors in Ke are understood as columns, that is, matrices of the type l/l
fu,v{u{w)) = v(f(w)) = A ■ u(w).
On the other hand, if we have fixed bases on V and W, then every choice of a matrix A of the type ra/n gives a unique linear mapping K" —> Km and thus also a mapping
107
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
space the "dextrorotary coordinate system" is chosen: if we place our hand on the x axis such that the fingers point in the direction of the axis and such that we can rotate the x axis in the xy plane so that x coincides with the y axis and they point in the same direction, then the thumb should point in the direction of the z axis. In such a system, this is a rotation in the negative sense in the plane xz (that is, the axis z turns in the direction towards x). Think about the positive and negative sense of rotations by all three axes. The sign is also consistent with the cycle x to y to z to x to y etc.... or 1 to 2 to 3 to 1
to.....etc. □
Knowledge of matrices allows us to write the matrix of rotation about any oriented axis. Let us start with a specific example:
2.E.2. Find the matrix of the rotation in the positive sense by the angle tt/3 about the line passing through the origin with the oriented directional vector (1,1,0) under the standard basis R3.
Solution. The given rotation is easily obtained by composing these three mappings:
• rotation through the angle tt/4 in the negative sense about the axis z (the axis of the rotation goes over on the x axis);
• rotation through the angle tt/3 in the positive sense about the x axis;
• rotation through the angle tt/4 in the positive sense about the z axis (the x axis goes over on the axis of the rotation).
The matrix of the resulting rotation is the product of the matrices corresponding to the given three mappings, while the order of the matrices is given by the order of application of the mappings - the first mapping applied is in the product the rightmost one. Thus we obtain the desired matrix
v-2	v-2	(
2	2	i
V2	V2	
2	2	
0	0	
/	3	1
/	4	4
	1	3
	4	4
V-	s/6	s/6
	4	4
A	0	0
t	1 2 v-3 2	
		
U\		
4		
I /		
2 /		
v-2	v-2	0
2	2	
V2	V2	0
2	2	
0	0	1
Note that the resulting rotation could be also obtained for instance by taking the composition of the three following mappings:
• rotation through the angle tt/4 in the positive sense about the axis z (the axis of rotation goes over on the axis y);
/ : V —> W. We have found the bijective correspondence between matrices of the fixed types (determined by dimensions of V and W) and linear mappings V —> W.
2.3.15. Coordinate transition matrix. If we choose V = W to be the same space, but with two different bases u, v, and consider the identity mapping for /, then the approach from the previous paragraph expresses the vectors of the basis u in coordinates with respect to the basis v. Let the resulting matrix be T.
Thus, we are applying the concept of the matrix of a linear mapping to the special case of the identity mapping idy.
V-
V
T=(idv)„,„ IK" ..................................> IK"
The resulting matrix T is called the coordinate transition matrix for changing the basis from u to the basis v.
The fact that the matrix T of the identity mapping yields exactly the transformation of coordinates between the two bases is easily seen.
Consider the expression of u with the basis u
u = xiui H-----h xnun,
and replace the vectors u{ by their expressions as linear combinations of the vectors v{ in the basis v. Collecting the terms properly, we obtain the coordinate expression x = (x1,..., i„) of the same vector u in the basis v. It is enough just to reorder the summands and express the individual scalars at the vectors of the basis. But this is exactly what we do when forming the matrix for the identity mapping, thus x = T ■ x.
We have arrived at the following instruction for building the coordinate transition matrix:
Calculating the matrix for changing the basis
Proposition. The matrix T for the transition from the basis u to the basis v is obtained by taking the coordinates of the vectors of the basis u expressed in the basis v and writing them as the columns of the matrix T. The new coordinates x in terms of the new basis v are then x = T ■ x, where x is the coordinate vector in the original basis u.
Because the inverse mapping to the identity mapping is again the identity mapping, the coordinate transition matrix is always invertible and its inverse T-1 is the coordinate transition matrix in the opposite direction, that is from the basis v to the basis u (just have a look at the diagram above and invert all the arrows).
2.3.16. More coordinates. Next, we are interested in the matrix of a composition of the linear mappings. H Oil. Thus, consider another vector space Z over K SeL^S of dimension k with basis w, linear mapping g : W —> Z and denote the corresponding matrix by g^w-
108
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
sense of ro-
• rotation through the angle 7r/3 in the positive sense about the axis y;
• rotation through the angle tt/4 in the negative about the axis z (the axis y goes over to the axis tation).
Analogously we obtain
\/2    \/2   n\   / I     n   vl\   /\/2 _\/2
0
2 2 u
\/2 \/2
0
0
2
0 1
u       2    1     /    2 2
\/2 \/2
f o   \) \o o
1
%/6 VE
4 4
□
2.E.3. Matrix of general rotation in R3. Derive the matrix of a general rotation in R3. Solution. We can do the same things as in the previous example with general values. Consider an arbitrary unit vector (x, y, z). Rotation in the positive sense by the angle p about this vector can be written down as a composition of the following rotations whose matrices we already know: i) rotation TZi  in the negative sense about the z axis   through  the   angle   with   cosine   equal to xJ\Jx2 + y2    =    xjVl — z2, that is,  with sine yj\J\ — z2, under which the line with the directional vector (x, y, z) goes over on the line with the directional vector (0, y, z). The matrix of this rotation is
Ri =
x/VT^.
y/Vl - z
c/vT
V    o 01/
ii) rotation 7^2 in the positive sense about the y axis through
the angle with cosine Vl — z2, that is, with sine z, under which the line with the directional vector (0, y, z) goes over on the line with the directional vector (1, 0,0). The matrix of this rotation is
ä2
iii) rotation 71$ in the positive sense about the x axis through the angle p with the matrix
/I      0 0 \
R-3 = I 0   cos(y>)   - sm(i^) , \0   sm(<p)     cos(</3) J
iv) rotation 7l2 1 with the matrix R2 1,
V -
W ■
The composition g o / on the upper row corresponds to the matrix of the mapping I" —> Kk on the bottom and we calculate directly (we write A for the matrix of / and B for the matrix of g in the chosen bases):
gv,w ° fu,v(x) = w ° g ° vT1 ° v.° I °uT1
= B ■ (A ■ x) = (B ■ A) ■ x = (g o f)^(x)
for every x G Kn. By the associativity of matrix multiplications, the composition of mappings corresponds to multiplication of the corresponding matrices. Note that the isomorphisms correspond exactly to invertible matrices and that the matrix of the inverse mapping is the inverse matrix.
The same approach shows how the matrix of a linear mapping changes, if we change the coordinates on both the domain and the codomain:
V -
V -
f
AV-
IV
s-1
where T is the coordinate transition matrix from u' to u and S is the coordinate change matrix from v' to v. If A is the original matrix of the mapping, then the matrix of the new mapping is given by A' = S*-1 AT.
In the special case of a linear mapping / : V —> V, that is the domain and the codomain are the same space V, we express / usually in terms of a single basis u of the space V. Then the change from the old basis to the new basis u' with the coordinate transition matrix T leads to the new matrix A = T~XAT.
2.3.17. Linear forms. A simple but very important case of linear mappings on an arbitrary vector space V over the scalars K appears with the codomain being the scalars themselves, i.e. mappings / :
_ V —> K. We call them linear forms.
If we are given the coordinates on V, the assignments of a single i-th coordinate to the vectors is an example of a linear form. More precisely, for every choice of basis v = (iii,..., ii„), there are the linear forms v* : V —> K such that ii* (vj ) = Sij, that is, ii* (vj ) = 1 when i = j, and n* (vj ) = 0 when i =^ j.
The vector space of all linear forms on V is denoted by V* and we call it the dual space of the vector space V. Let us now assume that the vector space V has finite dimension n. The basis of V*, v* = (n*,..., n*), composed of assignments of individual coordinates as above, is called the dual basis to v. Clearly this is a basis of the space V*, because these forms are evidently linearly independent (prove
109
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
v) rotation 7l11 with the matrix R1 1.
The matrix of the composition of these mappings, that is, the matrix we are looking for, is given by the product of the rotations in the reverse order:
R^1 ■ i?21' -^3 ' R-2 ' Ri —
(1 — t + tx      txy — zs txz + ys
yxt + zs     1 — t + ty2 tyz — xs
zxt — ys      tzy + xs 1 —t + tz2 ,
where t = 1 — cos p and s = sin p.
□
We got familiar with matrices of linear maps now. But what happen with the matrix of a linear mapping, if we change the base of the vector space? (we can imagine it for example as the change of coordinate system of the observer) We have to understand, what happens with the coordinates of vectors first. The key to all this is the transition matrix (see 2.3.15). We will further write e for the standard basis, that is vectors ((1,0,0), (0,1,0), (0,0,1)) (these vectors could be any three linearly independent vectors in a vector space; with naming them as we did, we identified the vector space with R3)
2.E.4. A vector has coordinates (1,2,3) in the standard basis e.    What are its coordinates in the basis
u= ((1,1,0), (1,-1,2), (3,1,5))?
Solution. We write the transiton matrix T for u to the standard basis first. We just write coordinates of the vectors which form the basis u in the columns:
For expressing the sought coordinates we albeit need the transition matrix from the standard basis to u. No problem, it is just T( — 1). (see 2.3.15 if you have not done so yet). We already know how to compute inverse matrix (see 2.1.10).
1
-1
Finally the sought coordinates are
T-\l,2,3f = (\,-l,\f.
□
Similarly we work with the matrix of a linear mapping.
it!) and if a e V* is an arbitrary form, then for every vector
U = X\V\ + ■ ■ ■ + xnvn
a(u) = xia(vi) + ■ ■ ■ + xna(yn)
= a{vi)vl(u) H-----h a(vn)v*(u)
and thus the linear form a is a linear combination of the forms
Taking into account the standard basis {1} on the one-dimensional space of scalars K, any choice of a basis v on V identifies the linear forms a with matrices of the type 1/n, that is, with rows y. The components of these rows are coordinates of the general linear forms a in the dual basis v*. Expressing such a form on a vector is then given by multiplying the corresponding row vector y with the column of the coordinates x of the vector u e V in the basis v:
a(u) = y-x = yxxx H-----h ynxn.
Thus we can see that for every finitely dimensional space V, the dual space V* is isomorphic to the space V. The choice of the dual basis provides such an isomorphism.
In this context we meet again the scalar product of a row of n scalars with a column of n scalars. We have worked with it already in the paragraph 2.1.3 on the page 73.
The situation is different for infinitely dimensional spaces. For instance the simplest example of the space of all polynomials K[x] in one variable is a vector space with a countable basis with elements v{ = x1. As before, we can define linearly independent forms v*. Every formal infinite sum JZSo aiv* *s now a well-defined linear form on K[x], because it will be evaluated only for a finite linear combination of the basis polynomials x*, i = 0,1,2,....
The countable set of all v* is thus not a basis. Actually, it can be proved that this dual space cannot have a countable basis.
2.3.18. The length of vectors and scalar product. When dealing with the geometry of the plane R2 in j/ the first chapter we also needed the concept of the length of vectors and their angles, see 1.5.7. For defining these concepts we used the scalar product of two vectors u = (x,y) and v = (x',y') in the form u ■ v = xx1 + yy'.
Indeed, the expression for the length of v = (x,y) is given by
\\v\\ = \/x2 + y2 =
while the (oriented) angle ip of two vectors u = (x, y) and v = (x1, y') is in the planar geometry given by the formula
cos p ■
xx + yy
Note that this scalar product is linear in each of its arguments, and we denote it by u ■ v or by (u, v). The scalar product defined in such a way is symmetric in its arguments and of
110
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.5. We are given a linear mapping R3 —> R3 in the standard basis as the following matrix:
\2    0 0/
Write down the matrix of this mapping in the basis
(/i,/2,/3) = ((1,1,0), (-1,1,1), (2,0,1)).
Solution. Again the transition matrix T for changing the basis from the basis / = (/i, fa, H) to the standard basis e can be obtained by writing down the coordinates of the vectors /i. Hi H m the standard basis as the columns of the matrix T. Thus we have
T -
The transition matrix for changing the basis from the standard basis to the basis / is then the inverse of T:
T"
3
1 I
\ 4 4       2 /
The matrix of the mapping in the basis / is then given by (see 2.2.11)
□
2.E.6. Consider the vector space of polynomials of one variable of degree at most 2 with real coefficients. In this space, consider the basis 1, x, x2. Write down the matrix of the derivative mapping in this basis and also in the basis
/ = (1 + X2, X, X + X2).
Solution. First we have to determine the matrix of the derivative mapping (let us denote the mapping as d, its matrix as D). We chose the basis (1, x, x2) as a standard basis e, so we have coordinates 1 ~ (1,0, 0), x ~ (0,1, 0) and x2 ~ (0,0,1). We look at the images of the basis vectors: d(l) = 0 ~ (0,0,0), d{x) = 1 ~ (1,0,0) and d(x2) = 2x ~ (0,2,0). Now we write the images as columns into the matrix D:
(0   1 0\ D =   0   0   2 . \0   0 0/
course ||i>|| = 0 if and only if v = 0. We also see immediately that two vectors in the Euclidean plane are perpendicular whenever their scalar product is zero.
Now we shall mimic this approach for higher dimensions. First, observe that the angle between two vectors is always a two-dimensional concept (we want the angle to be the same in the two-dimensional space containing the two vectors u and v). In the subsequent paragraphs, we shall consider only finitely dimensional vector spaces over real scalars R.
Scalar product and orthogonality
A scalar product on a vector space V over real numbers is a mapping ( , ) : V x V —> R which is symmetric in its arguments, linear in each of them, and such that (v,v) > 0 and ||w||2 = (v, v) = 0 if and only if v = 0.
The number ||i>|| = sj(v, v) is called the length of the vector v.
Vectors v and w e V are called orthogonal or perpendicular whenever (v, w) = 0. We also write v _L w. The vector v is called normalised whenever ||i>|| = 1.
The basis of the space V composed exclusively of mutually orthogonal vectors is called an orthogonal basis. If the vectors in such a basis are all normalised, we call the basis orthonormal.
A scalar product is very often denoted by the common dot, that is, (u,v) = u ■ v. Thus, it is then necessary to recognize from the context whether the dot means a product of two vectors (the result is a scalar) or something different (e.g. we often denote the product of matrices and product of scalars in the same way).
Because the scalar product is linear in each of its arguments, it is completely determined by its values on pairs of basis vectors. Indeed, choose abasis u = (ui,..., un) of the space V and denote
S{j — (ui, Uj ).
Then from the symmetry of the scalar product we know Sij = Sji and from the linearity of the product in each of its arguments we get
If the basis is orthonormal, the matrix S is the unit matrix. This proves the following useful claim:
Scalar product in coordinates
Proposition. For every orthonormal basis, the scalar product is given by the coordinate expression
(x,y) =yT -x.
For each basis of the space V there is the symmetric matrix S such that the coordinate expression of the scalar product is
(x,y) =yT ■ S -x.
Ill
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Now we write the coordinates of the basis vectors of the basis / into the columns:
to get the transition matrix from / to e. As in the previous example we get the matrix of d in the basis / as
T_1DT = where we had to compute
□
2.E.7. In the standard basis in R3, determine the matrix of the rotation through the angle 90° in the positive sense about the line (t, t, t), t g R, oriented in the direction of the vector (1,1,1). Further, find the matrix of this rotation in the basis
g = ((1,1,0), (1,0,-1), (0,1,1)).
Solution. We can easily determine the matrix of the given rotation in a suitable basis, that is, in a basis given by the directional vector of the line and by two mutually perpendicular vectors in the plane x + y + z = 0, that is, in the plane of vectors perpendicular to the vector (1,1,1). We note that the matrix of the rotation in the positive sense through 90° in an
orthonormal basis in R2 is f ^    ^ J. In the orthogonal ba-
sis with vectors of length k, I respectively, it is
0
-k/l
Kl/k 0
If we choose perpendicular vectors (1, —1,0) and (1,1, —2) in the plane x + y + z = 0 with lengths \/2 and \/Q, then in the basis / = ((1,1,1), (1, -1,0), (1,1, -2)) the rotation
/l      0        0 \ we are looking for has matrix   0     0     — \/3 I. In order
\0   1/V3     0 / to obtain the matrix of the rotation in the standard basis, it
is enough to change the basis. The transition matrix T for changing the basis from the basis / to the standard basis is obtained by writing the coordinates (under the standard basis) of the vectors of the basis / as the columns of the matrix
/ll 1\ T: T =   1   -1    1    . Finally, for the desired matrix R,
V    0 -2)
we have
Notice, that with symmetric matrix S it is just a matter of convention in which order we insert the vectors: the formula
xT ■ S ■ y = (xT ■ S ■ y)T = yT ■ ST ■ x = yT ■ S ■ x
produces the same value. However, we shall later consider the second argument as a linear form, thus it seems to be more convenient to use the expression yT ■ S ■ x.
2.3.19. Orthogonal complements and projections. For ev-
ery fixed subspace W C V in a space with ItC^,    scalar product, we define its orthogonal com-ssSfg^Z,* plement as
W-L = {u g V; u _L v for all v g W}. It follows directly from the definition that W1- is a vector sub-space. If W C V has a basis («i,..., uk) then the description for W1- is given as k homogeneous equations for n variables. Thus W1- will have dimension at least n — k. Also u eW n W1- means that (u, u) = 0, and thus also u = 0 by the definition of scalar product. Clearly then, V is the direct sum
v = w® W±.
A linear mapping / : V —> V on any vector space is called a projection, if we have
/"/ = /■
In such a case, we can write, for every vector v g V,
v = /(„) + („ _ /(„)) g Im(/) + Ker(/) = V
and if v g Im(/) and f(v) = 0, then also v = 0. Thus the above sum of the subspaces is direct. We say that / is a projection to the subspace W = Im(/) along the subspace U = Ker(/). In words, the projection can be described naturally as follows: we decompose the given vector into a component in W and a component in U, and forget the second one.
If V has a scalar product, we say that the projection is orthogonal if the kernel is orthogonal to the image.
Every subspace W =^ V thus defines an orthogonal projection to W. It is a projection to W along W±, given by the unique decomposition of every vector u into components uw g W and uw± g W±, that is, linear mapping which maps uw + uw± to uw-
2.3.20. Existence of orthonormal bases. It is easy to see that on every finite dimensional real vector space there exist scalar products. Just choose any basis. Define lengths so that each basis vector is of unit length. Immediately we have a scalar product. Call it orthonormal. In this basis the scalar products of vectors are computed as in the formula in the Theorem 2.3.18.
More often we are given a scalar product on a vector space V, and we want to find an appropriate orthonormal basis for it. We present an algorithm using suitable orthogonal projections in order to transform any basis into an orthogonal one. It is called the Gramm-Schmidt orthogonalization process.
112
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
r = t
■ t~
1/3-^/3
1/3 1/3 + VŠ/3
1/3 + v/3/3^ 1/3-^/3 1/3 )
This result can be checked by substituting into the matrix of general rotation (2.E.3). By normalizing the vector (1,1,1) we obtain the vector (x, y, z) = (l/v'S, 1A/3,1A/3), cos(</9) = 0, sin(p) = 1. □
2.E.8. Matrix of general rotation revisited. We derive s\\)iv the matrix of (general) rotation from (2.E.3) 'Af/Vc through the  angle  <p  in the positive sense W    about the unit vector (x, y, z) in a different way, analogically to the previous exercise.   In the basis /   =   ((x,y,z),(-y,x, 0),(zx, zy,z2 - 1)), that is, in the orthogonal basis composed of the directional vector of the axis of rotation and of two mutually perpendicular vectors with sizes \J\ — z2 lying in a plane perpendicular
to the axis of rotation, the matrix corresponding to the
/I      0 0 \
rotation is a =     0   cos(y>)   — sm(ip)   .   The matrix
\0 sin(</9) cos (1,5) / for changing the basis from / to the standard basis is then
(x
y
t =
-y	zx	\
x	zy	with the inverse matrix
0	z2-	
T"	H	(   x         y z\ — j^z              0 1 . V            T=*b _V
Finally, for the matrix r of the rotation we obtain
r = t -a-t'1
txy — zs
'1 -t + tx' yxt + zs , zxt — ys
l-t + ty2 tzy + xs
cos ip and s =
sin ip, and we get the
where again t = 1 same matrix as before.
When multiplying and simplifying, we must repeatedly use the assumption x2 + y2 + z2 = 1.
Through a more detailed analysis of properties of various types of linear mapping we now obtain a deeper understanding of tools we are given by vector spaces for linear modeling of processes and systems.
The point of this procedure is to transform a given sequence of independent generators v1,..., vk of a finite dimensional space V into an orthogonal set of independent generators of V.
Gramm-Schmidt orthogonalization
Proposition. Let (ui,..., Uk) be a linearly independent k-tuple of vectors of a space V with scalar product. Then there exists an orthogonal system of vectors (i>i,..., Vk) such that Vi G span{ui,... , Ui}, and span{ui,... , Ui} = span{«i,..., v^, for all i = 1,..., k. We obtain it by the following procedure:
• The independence of the vectors Ui ensures that ui =f^ 0; we choose v\ = u\.
• If we have already constructed the vectors vi,...,V£ with the required properties and if £ < k, we choose Vf+i  = Uf+i + aivi + ■ ■ ■ + cifVf, where a, =
Proof. We begin with the first (nonzero) vector vi and calculate the orthogonal projection v2 to
spanjwij^ C span{i>i, i>2}.
The result is nonzero if and only if v2 is independent of v1. All other steps are similar:
In step £, £ > 1 we seek the vector vi+1 = ui+1 + a1v1 +
----h a/vt satisfying {vi+i, Vi) = 0 for alii = 1,..., £. This
implies
0 = {ue+1 + a1v1-\-----\-aeve,Vi) = {ue+1, v{) + a{{vi, v{)
and we can see that the vectors with the desired properties are determined uniquely up to a scalar multiple. □
Whenever we have an orthogonal basis of a vector space V, we just have to normalise the vectors in order to obtain an orthonormal basis. Thus, starting the Gramm-Schmidt orthogonalization with any basis of V, we have proven:
Corollary. On every finite dimensional real vector space with scalar product there exists an orthonormal basis.
In an orthonormal basis, the coordinates and orthogonal projections are very easy to calculate. Indeed, suppose we have an orthonormal basis (ei,..., en) for a space V. Then every vector v = x1e1 + ■ ■ ■ + xnen satisfies
(ei,v) = (e,, xiei H-----h xnen) = x{
and so we can always express
(1) v = (e1,v)e1 H-----h (e„, v)en.
If we are given a subspace W C V and its orthonormal basis (ei,..., e^), then we can extend it to an orthonormal basis (ei,..., en) for V. Orthogonal projection of a general vector v e V to W is then given by the expression
v i-> (e1,v)e1 H-----h (en,v)ek.
113
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.E.9. Consider complex numbers as a real vector space and choose 1 and i for its basis. Determine in this basis the matrix of the following linear mappings:
a) conjugation,
b) multiplication by the number (2 + i).
Determine the matrix of these mappings in the basis / =
((l-i),(l + i)).
Solution. In order to determine the matrix of a linear mapping in some basis, it is enough to determine the images of the basis vectors.
a) For conjugation we have 1 i-> 1, i i-> —i, written in the coordinates (1,0) i-> (1,0) and (0,1) i-> (0, -1). By writing
the images into the columns we obtain the matrix
1 0
0 -1
In the basis / the conjugation interchanges basis vectors, that is, (1,0) m- (0,1) and (0,1) m- (1,0) and the matrix of
conjugation under this basis is
0 1
1 0
b) For the basis (1, i) we obtain 1 i-> 2 + i, i i-> 2i — 1, that is, (1,0) m- (2,1), (0,1) m- (2,-1). Thus the matrix of
multiplication by the number 2 + i under the basis (1, i) is:
2 -1N 1 2
We determine the matrix in the basis /. Multiplication by (2+i) gives us: (1-i) m- (l-i)(2+i) = 3-i, m- (1 + 3i). Coordinates (a, 6) / of the vector 3 — i in the basis / are given, as we know, by the equation a-(l—i)+b-(l+i) = 3+i, that is, (3 + i)f_ = (2,1). Analogously (1 + 3i)f_ = (-1,2).
2 -1N
Altogether, we obtain the matrix
1 2
Think about the following: why is the matrix of multiplication by 2 + i the same in both bases? Would the two matrices in these bases be the same for multiplication by any complex number? □
2.E.10. Determine the matrix A which, under the standard basis of the space R3, gives the orthogonal projection on the vector subspace generated by the vectors ui = (—1,1,0) and u2 = (-1,0,1).
Solution. Note first that the given subspace is a plane containing the origin with normal vector u3 = (1,1,1). The ordered triple (1,1,1) is clearly a solution to the system
-x1 + x2 =0, -xi +   x3   = 0,
that is, the vector u3 is perpendicular to the vectors u1,u2.
Under the given projection the vectors u1 and u2 must
map to themselves and the vector u3 on the zero vector. In
In particular, we need only consider an orthonormal basis of the subspace W in order to write the orthogonal projection to W explicitly.
Note that in general the projection / to the subspace W along U and the projection gioU along W is constrained by the equality g = idy —/. Thus, when dealing with orthogonal projections to a given subspace W, it is always more efficient to calculate the orthonormal basis of that space W or W1- whose dimension is smaller.
Note also that the existence of an orthonormal basis guarantees that for every real space V of dimension n with a scalar product, there exists a linear mapping which is an isomorphism between V and the space R™ with the standard scalar product (i.e. respecting the scalar products as well). We saw already in Theorem 2.3.18 that the desired isomorphism is exactly the coordinate assignment. In words - in every orthonormal basis the scalar product is computed by the same formula as the standard scalar product in R™.
The constant coefficient is the determinant \A\. We shall see later that this coefficient describes how much the linear mapping scales the volumes.
We shall return to the questions of the length of a vector and to projections in the following chapter in a more general context.
2.3.21. Angle between two vectors. As we have already noted, the angle between two linearly independent vectors in the space must be the same as when we consider them in the two-dimensional subspace they generate. Basically, this is the reason why the notion of angle is independent of the dimension of the original space. If we choose an orthogonal basis such that its first two vectors generate the same subspace as the two given vectors u and v (whose angle we are measuring), we can simply take the definition from the planar geometry. Independently of the choice of coordinates we can formulate the definition as follows:
Angle between two vectors
The angle p between two vectors v and w in a vector space with a scalar product is given by the relation
(vpw)
cos p ■
IMIIHI
The angle denned in this way does not depend on the order of the vectors v, w and it is chosen in the interval 0 < p < it.
We shall return to scalar products and angles between vectors in further chapters.
2.3.22. Multilinear forms. The scalar product was given as a mapping from the product of two copies of a vector space V into the space of scalars, which was linear in each of its arguments. Similarly, we will work with mappings from the product of k copies of a vector space V into the scalars, which are linear in each of its k arguments. We speak of k-linear forms.
114
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
the basis composed of ui, u2, u3 (in this order) is thus the matrix of this projection
Using the the transition matrix for changing the basis
T :
J \ 23 i3        i3 I
from the basis (ui, u2, u3) to the standard basis, and from the standard basis to the basis (ui, u2, u3) we obtain
-I   -1   1\      I   0 0> ,4=1     0 1-010 \ 0     1    1/    \0   0 0/
J       \ 23
i3   i3 !
□
F. Inner products and linear maps
2.F.I. Write down the matrix of the mapping of orthogonal projection on the plane passing through the origin and perpendicular to the vector (1,1,1).
Solution. The image of an arbitrary point (vector) x = (x1,x2,x3) e K3 under the considered mapping can be obtained by subtracting from the given vector its orthogonal projection onto the direction normal to the considered plane, that is, onto the direction (1,1,1). This projection p is given by (see 1) as
(x, (1,1,1)}
= (
l(l,l,l)l2
Xľ+X2+X3   Xľ+X2+X3   Xi + x2 + x3 ^
3 3 The resulting mapping is thus
x p
,2h [ 3	x2 + x3
	3 '
/ 2	i ľ
	23 !
	31 23
\ 3	3       3 .
We have (correctly) obtained the same matrix as in the exercise 2.E. 10. □
Most often we will meet bilinear forms, that is, the case a : V x V —> K, where for any four vectors u, v, w, z and scalars a, b, c and d we have
a(au + bv, cw + dz) = aca(u, w) + ada(u, z)
+ bca(v, w) + bda(v, z).
If additionally we always have
a(u, w) = a(w, u),
then we speak of a symmetric bilinear form. If interchanging the arguments leads to a change of sign, we speak of an antisymmetric bilinear form.
Already in planar geometry we have denned the determinant as a bilinear antisymmetric form a, that is, a(u, w) = —a(w,u). In general, due to the theorem 2.2.5, we know that the determinant with dimension n can be seen as an n-linear antisymmetric form.
As with linear mappings it is clear that every fc-linear form is completely determined by its values on all fc-tuples of basis elements in a fixed basis. In analogy to linear mappings we can see these values as fc-dimensional analogues to matrices. We show this by an example with k = 2, where it will correspond to matrices as we have defined them.
Matrix of a bilinear form
If we choose a basis uonV and define for a given bilinear form a scalars = a(ui,Uj) then we obtain for vectors v, w with coordinates x and y (as columns of coordinates)
a(v,w) = ^ aijxiVj
i,j=l
where A is a matrix A= (a^-).
A-y,
Directly from the definition of the matrix of a bilinear form we see that the form is symmetric or antisymmetric if and only if the corresponding matrix has this property.
Every bilinear form a on a vector space V defines a mapping V —> V*, v i-> a(v, ). That is, by placing a fixed vector in the first argument we obtain a linear form which is the image of this vector. If we choose a fixed basis on a finitely dimensional space V and a dual basis V*, then we have the mapping
x H> (y H> xT -A-y).
All this is a matter of convention. Also we may fix the second vector and get a linear form again.
4. Properties of linear mappings
In order to exploit vector spaces and linear mappings in modelling real processes and systems in other sciences, we need a more detailed analysis of properties of diverse types of linear mappings.
115
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.F.2. In R3 write down the matrix of the mirror symmetry with respect to the plane containing the origin and (1,1,1) being its normal vector.
Solution. As in 2.F.1 we get the image of an arbitrary vector x = (xi,x2,x3) G R3 with the help of the orthogonal projection onto the direction (1,1,1). Unlike in the previous example, we need to subtract the projection twice (see image). Thus we get the matrix:
x — 2p =
£l       2(x2 + X3)   £2       2(xi + X3)   £3       2(x1 + X2)
'¥ 3      ' Y 3
	3
1	2
3	3
2	1
I	32
. 3	3
Second solution. The normed normal vector of the mirror plane is n = (1,1,1). We can express the mirror image of v under the mirror symmetry Z as follows: Z(y) = v — 2{v,n)n = v — 2n ■ (nT ■ v) = v — 2{n ■ nT) ■ v = ((E — 2n ■ nT)v (where we have used (v, n) = v ■ nT for the standard scalar product and the associativity of the matrix multiplication). We get the same matrix:
E-2n-n1 =
□
2.F.3. Consider R3, with the standard coordinate system. In the plane z = 0 there is a mirror and at the point [4,3,5] there is a candle. The observer at the point [1,2,3] is not aware of the mirror, but sees in it the reflection of the candle. Where does he think the candle is?
Solution. Independently of our position, we see the mirror image of the scene in the mirror (that is why it is called a mirror image). The mirror image is given by reflecting the scene (space) by the plane of the mirror, the plane 2 = 0. The reflection with respect to this plane changes the sign of the 2-coordinate. That is we can see the candle at the point [4,3,-5]. □ By using the inner product we can determine the (angular) deflection of the vectors:
2.4.1.   We begin with four examples in the lowest dimension of interest. With the standard basis of the plane R2 and with the standard scalar 7-y/ product we consider the following matri-/// ■    ces of mappings / : R2 —> R2:
0 1 0 0
a 0 0 b
0 -1
1 0
The matrix A describes the orthogonal projection along the subspace
W={(0,a); a G R} C R2
to the subspace
V = {(a,0); a G R} C R2,
that is, the projection to the x-axis along the y-axis. Evidently for this / : R2 —> R2 we have / o / = / and thus the restriction /1 v of the given mapping on its codomain is the identity mapping. The kernel of / is exactly the subspace W.
The matrix B has the property B2 = 0, therefore the same holds for the corresponding mapping /. We can envision this as the differentiation of polynomials Ri [x] of degree at most one in the basis (l,x) (we shall come to differentiation in chapter five, see 5.1.6).
The matrix C gives a mapping /, which rescales the first vector of the basis a-times, and the second one &-times. Therefore the whole plane divides into two subspaces, which are preserved under the mapping and where it is only a homothety, that is, scaling by a scalar multiple (the first case was a special case with a = 1, b = 0). For instance the choice a = 1, b = — 1 corresponds to axial symmetry (mirror symmetry) under the x-axis, which is the same as complex conjugation x+iy i-> x—iy on the two-dimensional real space R2 ~ C in basis (1, i). This is a linear mapping of the two-dimensional real vector space C, but not of the one-dimensional complex space C.
The matrix D is the matrix of rotation by 90 degrees (the angle 7r/2) centered at the origin in the standard basis. We can see at first glance that none of the one-dimensional subspaces is preserved under this mapping.
Such a rotation is a bijection of the plane onto itself, therefore we can surely find distinct bases in the domain and codomain, where its matrix will be the unit matrix E. We simply take any basis of the domain and its image in the codomain. But we are not able to do this with the same basis for both the domain and the codomain.
Consider the matrix D as a matrix of the mapping g : C2 —> C2 with the standard basis of the complex vector space C2. Then we can find vectors u = (i, 1), v = (—2,1), for which we have
0 -1
1 0
0 -1
1 0
-1 -1
—2
116
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.F.4. Determine the deflection of the roots of the polynomial x2 — i considered as vectors in the complex plane. Solution. The roots of the given polynomial are square roots of i. The arguments of the square roots of any complex numbers differ according to the de Moivre theorem by it. Their deflection is thus always it. □
2.F.5. Determine the cosine of the deflection of the lines p, q in R3 given by the equations
p   :   —2x + y + z = 1
x + 3y — Az = 5 q   :   x — y = —2
z = 6
o
2.F.6. Using the Gram-Schmidt orthogonalisation, obtain the orthogonal basis of the subspace
U = {(x1,x2,x3,xA)T e R4; x1 + x2 + x3 + xA = 0}
of the space R4.
Solution. The set of solutions of the given homogeneous linear equation is clearly a vector space with the basis
					(A
	1		0		0
lil =	0		1		0
	X.0/1		X.0/1		
shall be denoted Denote by vi, v2, v3, vectors of the orthogonal basis obtained using the Gram-Schmidt orthogonalisation process.
First set v1 = ui. Then let
v2
u2 -
■ Vl
1
\Vl\
vi = u2 - -v1 = ( --, --, 1,0
that is, choose a multiple v2
V-3 = U3
\Vl\
■ V-i u-■vi- '
2
= (-1. ■ v2
1 1
v2
■ v2
-1,2,0)T. Then let
1 1
= U-i--V-i--v2
2 6
1    l-    l- 1
3'   3' 3'
Altogether we have
					(A
	1		-i		-i
1>1 =	0		2		-i
					
Due to the simplicity of the exercise we can immediately give an orthogonal basis of the vectors
(1,-1,0,0)T,   (0,0,1,-1)T, (1,1,-1,-1)T
That means that in the basis (u, v) on C2, the mapping g has the matrix
K =
i 0 0 -i
Notice that by extending the scalars to C, we arrive at an analogy to the matrix C with diagonal elements a = cos(^tt) + jsin(1;7r) and its complex conjugate a. In other words, the argument of the number a in polar form provides the angle of the rotation.
This is easy to understand, if we denote the real and imaginary part of the vector u as follows
xu + Wu = Re u + i Im u ■
The vector v is the complex conjugate of u. We are interested in the restriction of the mapping g to the real vector subspace
V = R2 n spanc{u, «}cC2. Evidently,
V = spanK{u + u, i(u - u)} = spanK{a;u, -yu}
is the whole plane R2. The restriction of g to this plane is exactly the original mapping given by the matrix D (notice this matrix is real, thus it preserves this real subspace). It is immediately seen that this is the rotation through the angle \ it in the positive sense with respect to the chosen basis xu, —yu. Work it by yourself with a direct calculation. Note also why exchanging the order of the vectors u and v leads to the same result, although in a different real basis!
2.4.2. Eigenvalues and eigenvectors of mappings. A key
to the description of mappings in the previous examples was the answer to the question "what are the vectors satisfying the equation j(u) = a ■ u for some suitable scalars a?".
We consider this question for any linear mapping / :
V —> V on a vector space of dimension n over scalars K. If we imagine such an equality written in coordinates, i.e. using the matrix of the mapping A in some bases, we obtain a system of linear equations
A ■ x — a ■ x = (A — a ■ E) ■ x = 0
with an unknown parameter a. We know already that such a system of equations has only the solution x = 0 if the matrix A—aE is invertible. Thus we want to find such values aeK for which A — aE is not invertible, and for that, the necessary and sufficient condition reads (see Theorem 2.2.11)
(1)
det(yl -a-E) = 0.
If we consider A = a as a variable in the previous scalar equation, we are actually looking for the roots of a polynomial of degree n. As we have seen in the case of the matrix D, the roots may exist in an extension of our field of scalars, if they are not in K.
117
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
or
(-1,1,1,-1)T,   (1,-1,1,-1)T, (-1,-1,1,1)T.
□
2.F.7. Write down a basis of the real vector space of the matrices 3x3 over R with zero trace. (The trace of a matrix is the sum of the elements on the diagonal). Write the coordinates of the matrix
in this basis.
2.F.8.   Find the orthogonal complement U1- of the subspace
U = {(x1,x2,x3,xA); xi = x3,x2 = x3 + 6xA} C R4.
Solution. The orthogonal complement U1- consists of just those vectors that are perpendicular to every solution of the system
xx -   x3 = 0,
x2   —   x3   —   6x4   = 0.
A vector is a solution of this system if and only if it is perpendicular to both vectors (1,0, —1, 0), (0,1, —1, —6). Thus we have
U± = {a- (1, 0, -1, 0) + b ■ (0,1, -1, -6); a, b £ R}.
□
2.F.9.   Find an orthonormal basis of the subspace Vcl, where V = {(x1,x2,x3,X4) £ R4 | x1 + 2x2 + x3 = 0}.
Solution. The fourth coordinate does not appear in the restriction for the subspace, thus it seems reasonable to select (0,0,0,1) as one of the vectors of the orthonormal basis and reduce the problem into the subspace R3. If we set the second coordinate equal to zero, then in the investigated space there are vectors with reverse first and third coordinate, notably, the unit vector (^j, 0, —^75,0). This vector is perpendicular to any vector which has first coordinate equal to the third coordinate. In order to get into the investigated subspace, we choose the second coordinate equal to the negative of the sum of the first and the third coordinate, and then normalise. Thus we choose the vector (, — -\,    , 0) and we are finished. □
Eigenvalues and eigenvectors
Scalars A £ K satisfying the equation j(u) = A ■ u for some nonzero vector w £ V are called the eigenvalues of mapping f. The corresponding nonzero vectors u are called the eigenvectors of the mapping f.
If u, v are eigenvectors associated with the same eigenvalue A, then for every linear combination of u and v,
f(au + bv) = af(u) + bf(y) = X(au + bv).
Therefore the eigenvectors associated with the same eigenvalue A, together with the zero vector, form a nontrivial vector subspace V\ C V. We call it the eigenspace associated with A. For instance, if A = 0 is an eigenvalue, the kernel Ker / is the eigenspace Vq.
We have seen how to compute the eigenvalues in coordinates. The independence of the eigenvalues from the choice of coordinates is clear from their definition. But let us look explicitely what happens if we change the basis. As a direct corollary of the transformation properties from the paragraph 2.3.16 and the Cauchy theorem 2.2.7 for calculation of the determinant of product, the matrix A1 in the new coordinates will be A' = P~1AP with an invertible matrix P. Thus
\P~1AP - \E\ = \P~1AP - P~1\EP\
= \P-\A- \E)P\
= IP^IIOA- \E)\\P\
= \A-\E\,
because the scalar multiplication is commutative and we know that |P_1| = |P|_1.
For these reasons we use the same terminology for matrices and mappings:
Characteristic polynomials
For a matrix A of dimension n over K we call the polynomial IA — XE\ £ Kn [A] the characteristic polynomial of the matrix A.
Roots of this polynomial are the eigenvalues of the matrix A. If A is the matrix of the mapping / : V —> V in a certain basis, then \A — \E\ is also called the characteristic polynomial of the mapping f.
Because the characteristic polynomial of a linear map-.,<gv.     ping / : V —> V is independent of the choice
"^J*__of the basis of V, the coefficients of individual
'£3j*|j powers of the variable A are scalars expressing CEigi? — some properties of /. In particular, they too cannot depend on the choice of the basis. Suppose dim V = n and A = (a^) is the matrix of the mapping in some basis. Then
\A - A ■ E\ = (-\)n\n + (-l^-Vn + ■ ■ ■ + anrl)Are-1 + --- + \A\\°.
The coefficient at the highest power says whether the dimension of the space V is even or odd.
118
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
G. Eigenvalues and eigenvectors
2.G.I.   Find the eigenvalues and the associated subspaces of eigenvectors of the matrix
A =
1 3
-2
Solution. First we find the characteristic polynomial of the matrix:
-1 — A -1 2
1
3-A -2
0 0
2-A
= A3 - 4A3 + 2A + 4
3
This polynomial has roots 2,1 + \/3, 1 — \/3, which are then the eigenvalues of the matrix. Their algebraic multiplicity is one (they are simple roots of the polynomial), thus each has associated only one (up to a non-zero multiple) eigenvector. Otherwise stated, the geometric multiplicity of the eigenvalue is one, see 3.4.10).
We determine the eigenvector associated with the eigenvalue 2. It is a solution of the homogeneous linear system with the matrix A - 2E:
—3xi + X2 = 0 —lx1+x2 = 0 2xi - 2x2   = 0.
The system has solution x1 = x2 = 0, x3 e R arbitrary. So the eigenvector associated with the value 2 is then the vector (0,0,1) (or any multiple of it).
Similarly we determine the remaining two eigenvectors - as solutions of the system [A - (1 + V3)E}x = 0. The solution of the system
(-2 - V3)Xl + x2   = 0 -lxi + (2 - V3)x2   = 0 2X1 - 2X2 + (1 - V3)x3   = 0
is the space {(2 - y/3, 1,2) t, t e R}.
That is the space of eigenvectors associated with the eigenvalue 1 + \/3.
Similarly we obtain that the space of eigenvectors associated with the eigenvalue 1 — \/3 is {(2 + \/3, 1, —2) t, t e R}. □
2.G.2.   Determine the eigenvalues and eigenvectors of the cgg3> matrix
The most interesting coefficient is the sum of the diagonal elements of the matrix. We have just proved that it does not depend on the choice of the basis and we call it the trace of the matrix A and denote it by Tr A. The trace of the mapping f is defined as a trace of the matrix in an arbitrary basis.
In fact, this is not so surprising once we notice that the trace is actually the linear approximation of the determinant in the neighbourhood of the unit matrix in the direction A. We shall deal with such concepts in Chapter 8 only. But since the determinant is a polynomial, we may see easily that the only terms in det(_E + tA) which are linear in the real parameter t are just the trace. We shall see relation to matrix exponential later in Chapter 8.
The coefficient at A0 is the determinant \A\ and we shall see later that it describes the rescaling of volumes by the mapping.
2.4.3. Basis of eigenvectors. We discuss a few important properties of eigenspaces now.
Theorem. Eigenvectors of linear mappings f : V —> V associated to different eigenvalues are linearly independent.
Proof. Let ai,... ,ak j^jt; I. mapping / and ui
be distinct eigenvalues of the .., Uk eigenvectors with these eigenvalues. The proof is by induction on the number of linearly independent vectors among the chosen ones.
Assume that ui,...,ue are linearly independent and ui+i — J2i ciui is meif linear combination. We can choose 1 = 1, because the eigenvectors are nonzero.  But then
f(ue+i) = ai+i ■ ui+i = J2i=i aW
■ Ui, that is,
f(ui+1) = ^2 ai+i -Ci-Ui a- f(ui) = ^2(
By subtracting the second and the fourth expression in the equalities we obtain 0 = Y?i=i(ai+i — ad ' ci ' ui- All the differences between the eigenvalues are nonzero and at least one coefficient q is nonzero. This is a contradiction with the assumed linear independence ui,...,ue, therefore also the vector u;+i must be linearly independent of the others. □
The latter theorem can be seen as a decomposition of a linear mapping / into a sum of much simpler mappings. If there are n = dim V distinct ? eigenvalues A,, we obtain the entire V as a direct sum of one-dimensional eigenspaces V\z. Each of them then describes a projection on this invariant one-dimensional subspace, where the mapping is given just as multiplication by the eigenvalue A,.
Furthermore, this decomposition can be easily calculated:
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
1   1 0\ A=   1   2   1 . V   2 1/
Describe the geometric interpretation of this mapping and write down its matrix in the basis:
ei   = (1,-1,1)
e2   = (1,2,0)
e3   = (0,1,1)
Solution. The characteristic polynomial of the matrix A is
1-A      1 0 1     2 — A     1    = -A3+4A2-2A = -A(A2-4A+2). 1        2 1-A
The roots of this polynomial are the eigenvalues, thus the
eigenvalues are 0, 2 + \[2, 2 — \[2. Thus eigenvalues are
0, 2 + \[2, 2 — \[2. We compute the eigenvectors associated
with the particular eigenvalues:
• 0: We solve the system
A i o\ M
1   2   1 \ 1x2   = 0 V   2   lj \x3J Its solutions form a one-dimensional vector space of eigenvectors: span{(l, —1,1)}.
• 2 + \[2\ We solve the system
/-(1 + V2)      1 0      \ fXl\
i       -V5      i     ) \x2  = o.
V     1 2     -(1 + V2)/ w
The solutions form a one-dimensional space span{ (1,1+ y/2,l + V2)}.
• 2 — \/2: We solve the system
(W2-i)   i      o   \ AA
1 1       \ \x2 \ = 0.
V    1        2    (v^-l)/ w Its solutions form a space of eigenvectors span{(l, 1 — y/2,1- V2)}.
Hence the given matrix has eigenvalues 0, 2 + \[2 and 2 — \[2, with the associated one-dimensional spaces of eigenvectors span{(l, —1,1)}, span{(l, 1 + \/2,1 + \/2)} and span{(l, 1 — \/2,1 — \/2)} respectively.
The mapping can thus be interpreted as a projection along the vector (1,-1,1) into the plane given by the vectors (1,1 + y/2,1 + y/2) and (1,1 - y/2,1 - y/2) composed with the linear mapping given by "stretching" by the factor corresponding to the eigenvalues in the directions of the associated eigenvectors.
Basis of eigenvectors
Corollary. If there exist n mutually distinct roots \ of the characteristic polynomial of the mapping f : V —> V on the n-dimensional space V, then there exists a decomposition of V into a direct sum of eigenspaces each of dimension one. This means that there exists a basis for V consisting only of eigenvectors and in this basis the matrix for f is the diagonal matrix with the eigenvalues on the diagonal. This basis is uniquely determined up to the order of the elements and scale of the vectors.
The corresponding basis (expressed in the coordinates in an arbitrary basis ofV) is obtained by solving n systems of homogeneous linear equations of n variables with matrices (A — \ ■ E), where A is the matrix of f in a chosen basis.
2.4.4. Invariant subspaces. We have seen that every eigenvector v of the mapping / : V —> V generates a subspace span{w} c V, which is preserved by the mapping /.
More generally, we say that a vector sub-space W C V is an invariant subspace for a linear mapping f,iff(W)cW.
If V is a finite dimensional vector space and we choose some basis («i ,...,«&) of a subspace W, we can always extend it to be a basis (ui,..., Uk, Mfc+i, • • • ,un) for the whole space V. For every such basis, the mapping will have a matrix A of the form
(1)
A ■
B C 0 D
where B is a square matrix of dimension k, D is a square matrix of dimension n — k and C is a matrix of the type n/(n — k). On the other hand, if for some basis («i,..., un) the matrix of the mapping / is of the form (1), then W = span{ui,..., Uk} is invariant under the mapping /.
By the same arguments, the mapping with the matrix A as in (1) leaves the subspace span{ufc+i,..., un} invariant, if and only if the submatrix C is zero.
From this point of view the eigenspaces of the mapping are special cases of invariant subspaces. Our next task is to find some conditions under which there are invariant complements of invariant subspaces.
2.4.5. We illustrate some typical properties of mappings on the spaces R3 and R2 in terms of eigenvalues and eigenvectors.
(1) Consider the mapping given in the standard basis by the matrix A
120
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Now we express it in the given basis. For this we need the matrix T for changing the basis from the standard basis to the new basis. This can be obtained by writing the coordinates of the vectors of the original basis under the new basis into the columns of the matrix T. But we shall do it in a different way - we obtain first the matrix for changing the basis from the new one to the original one, that is, the matrix T-1. We just write the coordinates of the vectors of the new basis into the columns:
/I    1 0\ T-1 =   -1   2   1 . V1    0 l)
Then
/ 0    0    1 \ T = T_1 1 =     1    0   -1 , V-2   1    3 J and for the matrix B of a mapping under new basis we have (see 2.3.16)
/0    5     2 \ B = TAT'1 =   0   -2   -1 .
\0   14    6 /
□
You can find more exercises on computing with eigenvalues and eigenvectors on the page 136.
In the case of a 3 x 3 matrix, you can use this special formula to find its characteristic polynomial:
2.G.3. For any n x n matrix A its characteristic polynomial ^4 — A E | is of degree n, that is, it is of the form
\A-\E\ = cn An+c„_i An_1H-----hci A+c0,   c„ ^ 0,
while we have
c„ = (-ir,    c„_i = (-l)n-1trA,    c0 = \A\. If the matrix A is three-dimensional, we obtain
\A-\E\ = -A3 + (trA) A2 + ciA+ \A\. By choosing A = 1 we obtain
\A- E \ = -1 + tryl + ci + | yl |. From there we obtain
\A-\E\ =
-A3 + (trA) A2 + (| A- E\ + 1 - trA- \ A\) A+ | A\.
Use this expression for determining the characteristic polynomial and the eigenvalues of the matrix
A-
32	-67	47
7	-14	13
-7	15	-6
We compute
\A - \E\ =
with roots Ai = eigenvalue A =
0
1 - A 0
1 0
-A
= -A3 + A2 + A-l, 1. The eigenvectors with
\ A	0	-1\
~ 0	0	0
/ Vo	0	o /
: 1,A2 = 1,A3  = -
1 can be computed:
/-i or
0    0 0 \ 1    0 -l/
with the basis of the space of solutions, that is, of all eigenvectors with this eigenvalue
«i = (0,1,0),   w2 = (1,0,1).
Similarly for A = — 1 we obtain the third independent eigenvector
A	0	A	A	0		
0	2	0 "	- 0	2	°	^u3 = (-1,0,1)
U	0	1/	Vo	0	0/	
Under the basis «i, u2, u3 (note that u3 must be linearly independent of the remaining two because of the previous theorem and ui, «2 were obtained as two independent solutions) / has the diagonal matrix
/I   0 0\ A=   0   1    0 . \0   0 -l)
The whole space R3 is a direct sum of eigenspaces, R3 = Vi ffi V2, with dim V\ = 2, and dim V2 = 1. This decomposition is uniquely determined and says much about the geometric properties of the mapping /. The eigenspace V\ is furthermore a direct sum of one-dimensional eigenspaces, which can be selected in other ways (thus such a decomposition has no further geometrical meaning).
(2) Consider the linear mapping / : R2 [x] —> R2 [x] defined by polynomial differentiation, that is,/(l) = 0, j(x) = 1, j(x2) = 2x. The mapping / thus has in the usual basis (l,x, x2) the matrix
The characteristic polynomial is | A—A- E\ = —A3, thus it has only one eigenvalue, A = 0. We compute the eigenvectors:
0	1	°\	ŕ	1	°\
0	0	2 .	- 0	0	1
Vo	0	0	Vo	0	0
The space of the eigenvectors is thus one-dimensional, generated by the constant polynomial 1.
The striking property of this mapping is that is no basis for which the matrix would be diagonal. There is the "chain" of vectors mapping four independent generators as follows: |x2i-^a;i-^li-^0 builds a sequence of subspaces without O    invariant complements.
121
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.G.4. Find the orthonormal complement of the vec-torspace spaned by the vectors (2,1,3), (3,16,7), (3, 5,4),
(-7,7,-10).
Solution. In fact the task consists of solving the system 2.A.3, which we have done already. □
2.G.5. Pauli matrices. In physics, the state of a particle with S\\N i'/, spin | is described with Pauli matrices. They are the TxT^c 2x2 matrices over complex numbers:
0i
0   1\ /O -ľ
1 0
0 -1
,1 oy\i o
For square matrices we define their commutator (denoted by square brackets) as [a±, a2\ := <J\<J2 — 02^1
Show that [ai, a2\ = 2ia3 and similarly [a1,a3] = 2ia2 and [(J2,(J3] = 2ia1. Furthermore, show that a\ = a\ = a2 = 1 and that the eigenvalues of the matrices a\, a2, a3 are ±1.
Show that for matrices describing the state of the particle with spin 1, namely
1   /0  1  o\    1   /0  -i   0\   /l  0   0 \
^ \0   1   0/ ' ^ \0    i     0/ ' \0   0 -1/ , the commuting relations are the same as in the case of Pauli matrices.
Equivalently it can be shown that under the notation
1 := ^   ^ ,/ := icr3,J := i<r2,K := ici forms the
vector space with basis (1,1, J, K) of an algebra of quaternions (the algebra is a vector space with binary bilinear operation of multiplication, in this case the multiplication is given by matrix multiplication). In order for the vector space to be an algebra of quaternions it is necessary and sufficient to show the following properties: T2 = J2 = K2 = — 1 and IJ = -JI = K,JK = -KJ = I and KI = -IK = J.
2.G.6.   Can the matrix
B =
5 6
6 5
be expressed in the form of the product B = P-1 ■ D ■ P for some diagonal matrix D and invertible matrix PI If possible, give an example of such matrices D, P, and find out how many such pairs there are.
Solution. The matrix B has two distinct eigenvalues, and
thus such an expression exists. For instance it holds that
[5 6\ i fy/2 -V2\ [11 0 \ i / V2 v7^ 1 6   5       2 \V2    V2 )\0    -1 J2\-V2 y/2,
2.4.6. Orthogonal mappings. We consider the special case of the mapping / : V —> W between spaces with scalar products, which preserve lengths for all vectors u e V.
Orthogonal mappings
A linear mapping / : V —> W between spaces with scalar product is called an orthogonal mapping, if for all u e V
(f(u),f(u)) = (u,u).
The linearity of / and the symmetry of the scalar product imply that for all pairs of vectors the following equality holds:
(f(u + v), f(u + v)) = (f(u), f(u)) + (f(v), /(„)) + 2{f(u),f(v)).
Therefore all orthogonal mappings satisfy also the seemingly stronger condition for all vectors u,v e V:
(f(u),f(v)) = (u,v),
i.e. the mapping / leaves the scalar product invariant if and only if it leaves invariant the length of the vectors. (We should have noticed that this is true for all fields of scalars, where 1 + 1 ^ 0, but it does hold true for Z2.)
In the initial discussion about the geometry in the plane we proved in the Theorem 1.5.10 that a linear mapping R2 —> R2 preserves lengths of the vectors if and only if its matrix in the standard basis (which is orthonormal with respect to the standard scalar product) satisfies AT ■ A = E, that is, A-1 = AT.
In general, orthogonal mappings / : V —> W must be always injective, because the condition (/(u), /(«)) = 0 implies (u, u) = 0 and thus u = 0. In such a case, the dimension of the range is always at least as large as the dimension of the domain of /. But then both dimensions are equal and / : V —> Im / is a bijection. If Im / ^ W, we extend the orthonormal basis of the image of / to an orthonormal basis of the range space and the matrix of the mapping then contains a square regular submatrix A along with zero rows so that it has the required number of rows. Without loss of generality we can assume that W = V.
Our condition for the matrix of an orthogonal mapping in any orthonormal basis requires that for all vectors x and y in the space K":
(A ■ xf ■ (A ■ y) = xT ■ (AT ■ A) ■ y = xT •
y-
Special choice of the standard basis vectors for x and y yields directly AT ■ A = E, that is, the same result as for dimension two. Thus we have proved the following theorem:
Matrix of orthogonal mappings
Theorem. Let V be a real vector space with scalar product and let f : V —> V be a linear mapping. Then f is orthogonal if and only if in some orthogonal basis (and then consequently in all of them) its matrix A satisfies AT = A-1.
122
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
There exist exactly two diagonal matrices d:
0 -l)> [o llj' but the columns of the matrix P-1 can be substituted with their arbitrary non-zero scalar multiples, thus there are infinitely many pairs d, p. □ As we have already seen in 2.G.2, based on the eigenvalues and eigenvectors of the given 3x3 matrix, we can often interpret geometrically the mapping it induces in R3. In particular, we notice that can do so in the following situations: If the matrix has 0 as eigenvalue and 1 as an eigenvalue with geometric multiplicity 2, then it is a projection in the direction of the eigenvector associated with the eigenvalue 0 on the plane given by the eigenspace of the eigenvalue 1. If the eigenvector associated with 0 is perpendicular to that plane, then the mapping is an orthogonal projection.
If the matrix has eigenvalue —1 with the eigenvector perpendicular to the plane of the eigenvectors associated with the eigenvalue 1, then it is a mirror symmetry through the plane of the eigenvectors associated with 1.
If the matrix has eigenvalue 1 with an eigenvector perpendicular to plane of the eigenvectors associated with the eigenvalue —1, then it is an axial symmetry (in space) through the axis given by the eigenvector associated with 1.
2.G.7. Determine what linear mapping '. by the matrix
is given
Solution. The matrix has a double eigenvalue —1, its associated eigenspace is span{(2,0,1), (1,1,0)}. Further, the matrix has 0 as the eigenvalue, with eigenvector (1,4, —3). The mapping given by this matrix under the standard basis is then an axial symmetry through the line given by the last vector composed with the projection on the plane perpendicular to the last vector, that is, given by the equation x + Ay — 3z = 0.
□
2.G.8. The theorem 2.4.7 gives us tools for recognising a matrix of a rotation in R3. It is orthogonal (rows orthogonal to each other equivalently the same for the columns). It has three distinct eigenvalues with absolute value 1. One of them is the number 1 (its associated eigenvector is the axis of the rotation). The argument of the remaining two, which are necessarily complex conjugates, gives the angle of the rotation
Proof. Indeed, if / preserves lengths, it must have the claimed property in every orthonormal basis. On the other hand, the previous calculations show that this property for the matrix in one such basis ensures length preservation. □
Square matrices which satisfy the equality AT = A-1 are called orthogonal matrices.
The shape of the coordinate transition matrices between orthonormal bases is a direct corollary of the above theorem. Each such matrix must provide a mapping K" —> K" which preserves lengths and thus satisfies the condition S*-1 = . When changing from one orthonormal basis to another one, the matrix of any linear mapping changes according to the relation
A' = STAS.
2.4.7. Decomposition of an orthogonal mapping. We take a more detailed look at eigenvectors and eigenvalues of orthogonal mappings on a real vector space V with scalar product.
Consider a fixed orthogonal mapping / : V —> V with the matrix A in some orthonormal basis. We continue as with the matrix d of rotation in 2.4.1.
We think first about invariant subspaces of orthogonal mappings and their orthogonal complements. Namely, given any subspace W C V invariant with respect to an orthogonal mapping / : V —> V, then for all v e W1- and nelfwe immediately see
(f(v),w) = </», /o/-») = <«,/-»> = 0
since f~1(w) g W, too. But this means that also f(W±) C W1- and we have proved a simple but very important proposition:
Proposition. The orthogonal complement of a subspace invariant with respect to an orthogonal mapping is also invariant.
If all eigenvalues of an orthogonal mapping are real, this jji I. claim ensures that there always exists a basis of V composed of eigenvectors. Indeed, the restriction of f to the orthogonal complement of an invariant sub-space is again an orthogonal mapping, therefore we can add one eigenvector to the basis after another, until we obtain the whole decomposition of V. However, mostly the eigenvalues of orthogonal mappings are not real. We need to deviate into complex vector spaces. We formulate the result right away:
123
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
in the positive sense in the plane given by the basis u\ + u\,
Determine what linear mapping is given by the ma-
3	16	-12
5	25	25
-16	93	24
25	125	125
12	24	107
25	125	125
Solution. First we notice, that the matrix is orthogonal (rows are mutually orhogonal, and equivalently the same with columns). The matrix has the following eigenvalues and corresponding eigenvectors: = (0,1,|);
(i
H -H)
' 5*' 5l>
— |i,i)3 = (1, — #i, #i). All three eigenval-
ues have absolute value one, which together with the observation of orthogonality tells us that the matrix is a matrix of rotation. Its axis is given by the eigenvector corresponding to the eigenvalue 1, that is the vector (0,1, |). The plane of rotation is the real plane in R3, which is given by the intersection of two dimensional complex space in C3 generated by the remaining eigenvectors with R3. It is the plane span{ (1,0, 0), (0, —4, 3)} (the first generator is the (real multiple of) i)2 + 1)3, the other one is the (real multiple of) i'(d2 —1)3), see 2.4.7). We can determine the rotation angle in this plane, It is a rotation by the angle arccos(|) = 0,2957T, which is the argument of the eigenvalue | + |i (or minus that number, if we would choose the other eigenvalue).
It remains to determine the direction of the rotation. First, recall that the meaning of the direction of the rotation changes when we change the orientation of the axis (it has no meaning to speak of the direction of the rotation if we do not have an orientation of the axis). Using the ideas from the proof of the theorem 2.4.7, we see that the given matrix acts by rotating by arccos(|)) in the positive sense in the plane given by the basis ((1,0,0), (0, §)). The first vector of the basis is the imaginary part of the eigenvector associated with the eigenvalue | + |i, the second is then the (common) real part of the eigenvectors associated with the complex eigenvalues. The order of the vectors in the basis is important (by changing their order the meaning of the direction changes). The axis of rotation is perpendicular to the plane. If we orient using the right-hand rule (the perpendicular direction is obtained by taking the product of the vectors in the basis) then the direction of the rotation agrees with the direction of rotation in the plane with the given basis. In our case we obtain by the vector product (0,1,-1) x (1,1,-1) = (0,-1,-1). It is
Orthogonal mapping decomposition
Theorem. Let f : V —> V be an orthogonal mapping on a real vector space V with scalar product. Then all the (in general complex) roots of the characteristic polynomial f have length one. There exists the decomposition of V into one-dimensional eigenspaces corresponding to the real eigenvalues A = ±1 and two-dimensional subspaces Px \ with A £ C\R, where f acts by the rotation by the angle equal to the argument of the complex number A in the positive sense. All these subspaces are mutually orthogonal.
Proof. Without loss of generality we can work with the space V = Rm with the standard scalar product. The mapping is thus given by an orthogonal matrix A which can be equally well seen as the matrix of a (complex) linear mapping on the complex space Cm (which just happens to have all of its coefficients real).
There exist exactly m (complex) roots of the characteristic polynomial of A, counting their algebraic multiplicities (see the fundamental theorem of algebra, 12.2.8). Furthermore, because the characteristic polynomial of the mapping has only real coefficients, the roots are either real or there are a pair of roots which are complex conjugates A and A. The associated eigenvectors in Cm for such pairs of complex conjugates are actually solutions of two systems of linear homogeneous equations which are also complex conjugate to each other - the corresponding matrices of the systems have real components, except for the eigenvalues A. Therefore the solutions of this systems are also complex conjugates (check this!).
Next, we exploit the fact that for every invariant sub-space its orthogonal complement is also invariant. First we find the eigenspaces V±i associated with the real eigenvalues, and restrict the mapping to the orthogonal complement of their sum. Without loss of generality we can thus assume that our orthogonal mapping has no real eigenvalues and that dim V = 2n > 0.
Now choose an eigenvalue A and let it a be the eigenvector in C2n associated to the eigenvalue A = a + i(3, (3 =^ 0. Analogously to the case of rotation in the plane discussed in paragraph 2.4.1 in terms of the matrix D, we are interested in the real part of the sum of two one-dimensional (complex) subspaces W = span{itA} ffi span{wA}, where u\ is the eigenvector associated to the conjugated eigenvalue A.
Now we want the intersection of the 2-dimensional com-
plex subspace W with the real subspace ]
which
is clearly generated (over R) by the vectors ma + «a and i(u\ —u\). We call this real 2-dimensional subspace P\\C R2n and notice, this subspace is generated by the basis given by the real and imaginary part of it a
xx = ReuA,    -y\ = -1mu\.
124
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
thus a rotation through arccos(|) in the positive sense about the vector (0, —1, —1), that is, a rotation through arccos(|) in the negative sense about the vector (0,1,1). □
2.G.10. Determine what linear mapping is given by the matrix
-1	3	-1
5	5	5
-8	9	2
5	5	5
8	-4	3
5	5	5
Solution. By already known method we find out that the matrix has the following eigenvalues and corresponding eigenvectors: 1, (1,2,0); | + |i,l, (1,1 + - i); | -|i,(l,l — 2,-1 + i). Though all three eigenvectors have absolute value 1, they are not orthogonal to each other, thus the matrix is not orthogonal. Consequently it is not a matrix of rotation. Nevertheless, it is a linear mapping which is "close" to a rotation. It is a rotation in the plane given by two complex eigenvectors (but this plane is not orthogonal to the vector (1,2,0), but it is preserved by the map). It remains to determine the direction of the rotation. First, we should recall that the meaning of the direction of the rotation changes when we change the orientation of the axis (it has no meaning to speak of the direction of the rotation if we do not have an orientation of the axis).
Using the same ideas as in the previous example, we see that the given matrix acts by rotating by arccos (|)) in the positive sense in the plane given by the basis ((1,1,-1), (0,1,). The first vector of the basis is the imaginary part of the eigenvector associated with the eigenvalue | + |2, the second is then the (common) real part of the eigenvectors associated with the complex eigenvalues. The order of the vectors in the basis is important (by changing their order the meaning of the direction changes). The "axis" of rotation is not perpendicular to the plane, but we can orient the vectors lying in the whole half-plane using the right-hand rule (the perpendicular direction is obtained by taking the product of the vectors in the basis) then the direction of the rotation agrees with the direction of rotation in the plane with the given basis. In our case we obtain by the vector product (0,1,-1) x (1,1,-1) = (0,-1,-1). It is thus a rotation through arccos(|) in the positive sense about the vector (0,-1,-1), that is, a rotation through arccos(|) in the negative sense about the vector (0,1,1). □
Because A ■ (u\ + u\) = \u\ + \u\ and similarly with the second basis vector, it is clearly an invariant subspace with respect to multiplication by the matrix A and we obtain
A ■ xx = axx + py\, A-yx = -ayx + fixx.
Because our mapping preserves lengths, the absolute value of the eigenvalue A must equal one. But that means that the restriction of our mapping to Px x is the rotation by the argument of the eigenvalue A. Note that the choice of the eigenvalue A instead of A leads to the same subspace with the same rotation, we would just have expressed it in the basis x\, y\, that is, the same rotation will in these coordinates go by the same angle, but with the opposite sign, as expected.
The proof of the whole theorem is completed by restricting the mapping to the orthogonal complement and finding another 2-dimensional subspace, until we get the required decomposition. □
We return to the ideas in this proof once again in chapter three, where we study complex extensions of the Euclidean vector spaces, see 3.4.4.
Remark. The previous theorem is very powerful in dimension three. Here at least one eigenvalue must be real ±1, since three is odd. But then the associated eigenspace is an axis of the rotation of the three-dimensional space through the angle given by the argument of the other eigenvalues. Try to think how to detect in which direction the space is rotated. Note also that the eigenvalue —1 means an additional reflection through the plane perpendicular to the axis of the rotation.
./img/0163b.jpg
We shall return to the discussion of such properties of matrices and linear mappings in more details at the end of the next chapter, after illustrating the power of the matrix calculus in several practical applications. We close this section with a general quite widely used definition:
125
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.G.11. Without any written computation determine the spectrum of the linear mapping /  :  R3   —>  R3 given by
(x1,X2,X3) M- {Xl +X3,X2,Xl + X3). O
2.G.12. Find the dimension of the eigenspaces of the eigenvalues Aj of the matrix
/4   0   0 0\
14   0 0
5   2   3 0' \0   4   0 3/
Spectrum of linear mapping
2.4.8. Definition. The spectrum of a linear mapping f : V —> V, or the spectrum of a square matrix A, is a sequence of roots of the characteristic polynomial / or A, along with their multiplicities, respectively. The algebraic multiplicity of an eigenvalue means the multiplicity of the root of the characteristic polynomial, while the geometric multiplicity of the eigenvalue is the dimension of the associated subspace of eigenvectors.
The spectral diameter of a linear mapping (or matrix) is the greatest of the absolute values of the eigenvalues.
In this terminology, our results about orthogonal mappings can be formulated as follows: the spectrum of an orthogonal mapping is always a subset of the unit circle in the complex plane. Thus only the values ±1 may appear in the real part of the spectrum and their algebraic and geometric multiplicities are always the same. Complex values of the spectrum then correspond to rotations in suitable two-dimensional sub-spaces which are mutually perpendicular.
126
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
H. Additional exercises for the whole chapter
2.H.I. Kirchhoff's Circuit Laws. We consider an application of Linear Algebra to analysis of electric circuits, using Ohm's law and Kirchhoff's voltage and current laws.
Consider an electric circuit as in the figure and write down the values of the currents there if you know the values
Vx = 20,   14 = 120,   14 = 50,   i?i = 10,   R2 = 30,   R3 = 4,   RA = 5,   R5 = 10,
Notice that the quantities J, denote the electric currents, while Rj are resistances, and 14 are voltages.
Solution. There are two closed loops, namely ABEF and EBCD and two branching vertices B and E of degree no less than 3. On every segment of the circuit, bounded by branching points, the electric current is constant. Set it to be I\ on the segment EFAB, I2 on EB, and I3 on BCDE.
Applying Kirchhoff's current law to branching points B and E we obtain: I\ + I2 = I3 and I3 — I\ = I2, which are, of course the same equations. In case there are many branching vertices, we write all Kirchhhoff's Current Law equations to the system, having at least one of those equations redundant.
Choose the counter clockwise orientations of the loops ABEF and EBCD. Applying Kirchhoff Voltage Law and Ohm's Law to the loop ABEF we obtain the equation:
14 + hR3 - I2R5 + V3 + hRi + hRi = 0.
Similarly, the loop EBCD implies
-14 + I3R2 -V3 + R5I2 = 0.
By combining all equations, we obtain the system
h   +      h   -      h   = 0, (Rs + Ri + R^h   -   R5I2   + = -14-14,
Rsh  + R2I3  =    14 + 14-Substituing the prescribed values we obtain the linear system
h   +      h   -      I3   = 0, 19/i   -   10/2   + = -70,
IO/2   +   30/3   = 170.
This has solutions h =-§§ « -1.509,   I2 = ^ « 4.132,   73 = ^^ 2.623. □
2.H.2. The general case. In general, the method for electrical circuit analysis can be formulated along the following steps:
i) Identify all branching vertices of the circuit, i.e vertices of degree no less than 3;
ii) Identify all closed loops of the circuit;
iii) Introduce variables Ik, denoting oriented currents on each segment of the circuit between two branching vertices;
127
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
iv) Write down Kirchhoff's current conservation law for each branching vertex. The total incoming current equals the total outgoing current;
v) Choose an orientation on every closed loop of the circuit and write down Kirchhoff's voltage conservation law according to the chosen orientation. If you find an electric charge of voltage Vj and you go from the short bar to the long bar, the contribution of this charge is Vj. It is — Vj if you go from the long bar to the short one. If you go in the positive direction of a current I and find a resistor with resistance Rj, the contribution is —Rjl, and it is Rjl if the orientation of the loop is opposite to the direction of the current /. The total voltage change along each closed loop must be zero.
vi) Compose the system of linear equations collecting all equations, representing Kirchhoff's current and voltage laws and solve it with respect to the variables, representing currents. Notice that some equations may be redundant, however, the solution should be unique.
To illustrate this general approach, consider the circuit example in the diagram.
Solution.
i) The set of branching vertices is {B, C, F, G, H}.
ii) The set of closed loops is {ABHG, FHBC, GHF, CDEF}.
iii) Let I\ be the current on the segment GAB, I2 on the segment GH, I3 on the segment HB, J4 on the segment BC, I5 on the segment FC, Z5 on the segment FH, I? on GF, and Is on CDEF.
iv) Write Kirchhoff's current conservation laws for the branching vertices:
• vertex B:      I\ + I3 = I4
• vertex C: I4 + 15 = Is
• vertex F:      I8 = I5 + I6 — I7
• vertex G:      —17 = I\ +I2
• vertex H:      I2 + Zg = I3
v) Write Kirchhoff's voltage conservation for each of the closed loops traversed counter-clockwise:
• loop ABHG:       -RJ2 + V3 + R2h - V2 = 0
• loop FHBC:       V4 + R3h - V3 = 0
• loop GHF:       RJ2 - V1 = 0
• loop CDEF:       R4IS - V4 = 0
Set the parameters: Rx =4, R2 = 7, R3 = 9, RA = 12, Vx = 10, V2 = 20, , V3 = 60, , VA = 120, to obtain the system
h+h-h=Q
h + k- hi = 0
k + h-I7-h=Q
h+I2+h = Q
128
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
12 - I3 + I6 = 0
77i - 4J2 = -40 9J4 = -60 4J2 = 10 12 J8 = 120
with the solution set h = J2 = f,   J3 = J4 = J5 = f,   4 = ^,   ^ = f,   J8 = 10.
□
2.H.3.   Solve the system of equations
zi   +     22 + x3 + x4 — 2x5 = 3,
2a2 + 2a;3 + 2x4 - 4x5 = 5,
—xi   —    x2 — x3 + x4 + 2x5 = 0,
—2xi   +   3x2 + 3x3 — 6x5 = 2.
Solution. The extended matrix of the system is
/ 1	1	1 1	-2	3	\
0	2	2 2	-4	5	
-1	-1	-1 1	2	0	
V -2	3	3 0	-6	2	/
Adding the first row to the third, adding its 2-multiple to the fourth, and adding the (—5/2)-multiple of the second to the fourth we obtain
/	1	1	1	1	-2	3	\	/	1	1	1	1	-2	3 \
	0	2	2	2	-4	5			0	2	2	2	-4	5
	0	0	0	2	0	3			0	0	0	2	0	3
V	0	5	5	2	-10	8	/	V	0	0	0	-3	0	-9/2 J
The last row is clearly a multiple of the previous, and thus we can omit it. The pivots are located in the first, second and fourth. Thus the free variables are x3 and x$ which we substitute by the real parameters t and s. Thus we consider the system
x\   +    X2   +    t   +    X4   —   2s   = 3, 2x2   +   2t   +   2x4   -   4s   = 5, 2x4 = 3.
We see that X4 = 3/2. The second equation gives
2x2 + 2t + 3 - 4s = 5,    that is,    x2 = l- t + 2s.
From the first we have ai + l-i + 2s + i + 3/2-2s = 3,   tj.   a;i = l/2. Altogether,
(xi, x2, x3, X4, x5) = (1/2, 1— t+2s, t, 3/2, s),   t,s e K.
Alternatively, we can consider the extended matrix and transform it using the row transformations into the row echelon form. We arrange it so that the first non-zero number in every row is 1, and the remaining numbers in the column containing this 1 are 0. We omit the fourth equation, which is a combination of the first three. Sequentially, multiplying the second and
129
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
1111-2
the third row by the number 1/2, subtracting the third row from the second and from the first and by subtracting the second row from the first we obtain
5/2  ) ~
3/2j
3/2 J
0 2 2 0   0 0
1   1 1
0 110 V 0   0   0   1 0 If we choose again x3
1111-2 0 111-2 0   0   0   1 0
10 0 0 0 110 0   0   0 1
0
0
s (t,s e K), we obtain the general solution (2.H.3) as above.
□
2.H.4.   Find the solution of the system of linear equations given by the extended matrix
/ 3 3 2
2 1 1
0 5-4
\ 5 3 3
3\
4
1
5 )
3	\	/	3	3	2 1	3	\
4			0	-3	-1 -2	6	
1			0	5	-4 3	1	
5	/	V	0	6	1 14	0	/
Solution. We transform the given extended matrix into the row echelon form. We first copy the first three rows and into the
last row we write the sum of the (2)-multiple of the first and of the (—3)-multiple of the last row. By this we obtain
/ 3   3    2 1 2   11 0 0 5-43 \ 5   3    3 -3
Copying the first two rows and adding a 5-multiple of the second row to the 3-multiple of the third and its 2-multiple to the
fourth gives
/ 3    3 2 0   -3 -1 0 5-4 \ 0    6 1
Copying the first, second and fourth row, and adding the fourth to the third, yields
1	3	\	/	3	3	2	1	3	\
-2	6			0	-3	-1	-2	6	
3	1			0	0	-17	-1	33	
14	0	/	V	0	0	-1	10	12	/
(3	3	2	1	3	\	(3	3	2	1	3	\
0	-3	-1	-2	6		0	-3	-1	-2	6	
0	0	-17	-1	33		0	0	-18	9	45	
V o	0	-1	10	12	/	V o	0	-1	10	12	/
With three more row transformations, we arrive at
/	3	3	2	1	3 ^			{	3	3	2	1		3 ^	
	0	-3	-1	-2	6				0	-3	-1	-2		6	
	0	0	-18	9	45				0	0	2	-1		-5	
V	0	0	-1	10	12 j				0	0	1	-10		-12 j	
/	3	3	2	1	3				(3	3	2	1		3	\
	0	-3	-1	-2	6				0	-3	-1	-2		6	
	0	0	1	-10	-12				0	0	1	-10		-12	
V	0	0	2	-1	-5	)			^ o	0	0	19		19	/
The system has exactly 1 solution. We determine it by backwards elimination
(3	3	2		1		3 N	\	í	3	3	2	0	2 ^				
0	-3	-1		-2		6			0	-3	-1	0	8				
0	0	1		-10		-12			0	0	1	0	-2				
V o	0	0		1		1	/	{	0	0	0	1	1 )				
	3	0	0	6		/	' 1	1	0	0	2 ^		/I	0	0	0	4 \
0	-3	0	0	6			0	1	0	0	-2		0	1	0	0	-2
0	0	1	0	-2			0	0	1	0	-2		0	0	1	0	-2
V o	0	0	1	1	)		v o	0	0	1	1 )		V o	0	0	1	1 /
130
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
The solution is
x1 = 4,   x2 =-2,   x3 =-2,   x4 = 1.
□
2.H.5.   Find all the solutions of the homogeneous system
x + y = 2z + v,   z + Au + v = 0,    — 3u = 0,   z = —v of four linear equations with 5 variables x, y, z, u, v.
Solution. We rewrite the system into a matrix such that in the first column there are coefficients of x, in the second there are coefficients of y, and so on. We put all the variables in equations to the left side. By this, we obtain the matrix
/l   1   -2    0 -1\
0   0    14 1
0   0 0-30 \0   0    1     0     1 /
We add (4/3)-multiple of the third row to the second and subtract then the second row from the fourth to obtain
/l   1   —2 0 -1\ /l   1   —2 0 -1\
0   0    1 4 1 0   0    1 0 1
000 -3 0 ~000 -3 0
\0   0    1 0 1/ \0   0    0 0 0/
We multiply the third row by the number —1/3 and add the 2-multiple of the second row to the first, which gives
/l   1   -2    0    -1\      /l   1   0   0 1\ 0   0    1     0     1 0   0   1   0 1
00    0    -3    0    ~   0   0   0   1 0
\ooo   o   o/ \oooooy
From the last matrix, we get immediately (reading from bottom to top) u = 0, z + v = 0, x + y + v = 0. Letting ?? and v = s and y = f, the complete solution is
(x, y, z, u, v) = (-t
which can be rewritten as fx\
y
s, t, — s, 0, s),    i,s £
2
w
t
1
0 0
V 0/
+ s
0
-1 0
V1/
(,s£
Notice that the second and the fifth column of the matrix together form a basis for the solutions. These are the columns which do not contain a leading 1 in any of its entries. □
2.H.6.   Determine the number of solutions for the systems (a)
12xi   +   V5x2   +   IIX3   = -9, xi —     5x3   = —9,
xi +    2x3   = -7;
(b)
4xi	+	2x2	- 12x3	= 0,
5xi	+	2x2		= 0,
2xi	-	x2	+ 6x3	= 4;
131
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
(c)
4xi + 2x2 - 12x3 = 0,
5xi + 2x2 — x3 = 1,
—2xi — X2 + 6x3 = 0.
Solution. The vectors (1,0, —5), (1,0,2) are clearly linearly independent, (they are not multiples of each other) and the vector (12, \/5,11) cannot be their linear combination (its second coordinate is non-zero). Therefore the matrix whose rows are these three linearly independent vectors (from the left side) is invertible. Thus the system for case (a) has exactly one solution.
For cases (b) and (c), it is enough to note that
(4,2,-12) = -2(-2,-1,6).
In case (b) adding the first equation to the third multiplied by two gives 0 = 8, hence there is no solution for the system. In case (c) the third equation is a multiple of the first, so the system has infinitely many distinct solutions. □
2.H.7.   Find a linear system, whose set of solutions is exactly
{(t + 1, 2t, 3t, At); t e R}. Solution. Such a system is for instance
2xi — X2 = 2,   2x2 — X4 = 0,   4x3 — 3x4 = 0. These solutions are satisfied for every t e R. The vectors
(2,-1,0,0),   (0,2,0,-1), (0,0,4,-3) giving the left-hand sides of the equations are linearly independent (the set of solutions contains a single parameter). □ 2.H.8. Solve the system of homogeneous linear equations given by the matrix
/0   V2   V3   V&      0 \
2    2^-2 -y/5
0    2    V5   2V3   -V3 ' \3    3 -3      0 /
o
2.H.9. Determine all solutions of the system
X2		+	X4 =	1,
2x2	- 3x3	+	4x4 =	-2,
	23	+	X4 =	2,
	- 23		=	1.
O
2.H.10. Solve
3x	-	5y	+	2u	+	Az =	2,
5x	+	7y	-	Au	-	6z =	3,
7x	-	Ay	+		+	3z =	4,
x	+	6y	-	2u	-	5z =	2
o
2.H.11. Determine whether or not the system of linear equations
3xi   +   3x2   +   x3   = 1,
2xi   +   3x2   —   x3   = 8,
2xi   —   3x2   +   x3   = 4,
3xi   —   2x2   +   x3   = 6 of three variables x1, x2, x3 has a solution. O
132
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.H.12. Determine the number of solutions of the system of 5 linear equations
AT-x = (1,2,3,4,5)T,
where
/3   1   7   5 0\ x = (Xl,x2,x3)T   and   A=   0   0   0   0   1 .
\2   1   4   3 0/
Repeat the question for the system
AT-x = (1,1,1,1,1)T
2.H.13. Depending on the parameter a G R, determine the solution of the system of linear equations
axi + 4x2 +2 x3 = 0, 2xi   +   3x2    —    x3   = 0.
2.H.14. Depending on the parameter a G R, determine the number of solutions of the system
f2\
5
3
V-3/
O
o
í4	1	4	a\	M
2	3	6	8	X2
3	2	5	4	
\6	-1	2	-8/	\xA)
o
2.H.15. Decide whether or not there is a system of homogeneous linear equations of three variables whose set of solutions is exactly
(a) {(0,0,0)};
(b) {(0,1,0), (0,0,0), (1,1,0)};
(c) {(a, 1,0); x e R};
(d) {(x,y,2y); x,y G R}.
O
2.H.16. Solve the system of linear equations, depending on the real parameters a, b.
x + 2y + bz   = a x — y + 2z   = 1 3a; — y   = 1.
2.H.17. Using the inverse matrix, compute the solution of the system
x\ + x2 + x3 + a4 = 2, x\ + x2 — x3 — a4 = 3, x\ — x2 + x3 — a4 = 3,
X\ — X2 — X3 + X4 = 5.
O
o
133
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.H.18.   For what values of parameters a, b e R has the system of linear equations
x\   — ax2   —   2x3   = b,
x\   +   (1 — a)x2 =   b — 3,
x\   +   (1 — a)x2   +   ax3   =   2b — 1
(a) exactly one solution;
(b) no solution;
(c) at least 2 solutions? (i.e. infinitely many solutions)
Solution. We rewrite it, as usual, in the extended matrix, and transform:
At the first step we subtract the first row from the second and the third; and at the second step we subtract the second from the third. We see that the system has a unique solution (determined by backward elimination) if and only if a ^ 0. If a = 0 and b= —2, we have a zero row in the extended matrix. Choosing x3 e R as a parameter then gives infinitely many distinct solutions. For a = 0 and b ^ —2 the last equation a = b + 2 cannot be satisfied and the system has no solution. Note that for a = 0, b = —2 the solutions are
(xu x2, x3) = (-2 + 2t, -3 - 2t, t), feR
and for a ^ 0 the unique solution is the triple
'-3a2 -ab-Aa + 2b + A     2& + 3a + 4  b + 2^
□
2.H.19.
'■   -   - (Xl\
x2   ,   b = yx3J
Find real numbers b1,b2,b3 such that the system of linear equations A ■ x = b has:
(a) infinitely many solutions;
(b) unique solution;
(c) no solution;
(d) exactly four solutions.
Solution. It is enough to choose b1 = b2 + b3 in case a) and b\ =^ b2 + b3 in case c). Since all possibilities for b1, b2, b3 are catered for, variant d) cannot occur. Variant b) cannot occur, since the matrix A is not invertible. □
2.H.20. Factor the following permutations into a product of transpositions:
.'1   2   3   4   5   6 7^
i)
ii)
iii)
7   6   5 4 3   2 1
1   2   3 4 5   6   7 8
6   4   1 2 5   8   3 7
123 4 56789 10
461 10 25983 7
2.H.21. Determine the parity of the given permutations:
.   (1   2   3 4 5   6 7^
\7   5   6 4 1   2 3
134
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
ii)
iii)
1   2   3 4   5   6   7 8
6   7   1 2   3   8   4 5
123 4   56789 10
971 10   25493 6
2.H.22. Find the algebraically adjoint matrix F* for
F ■
a, ß, 7, S e
O
2.H.23. Calculate the algebraically adjoint matrix for the matrices
(a)
/3 -2 0
0 2 2 1
1 -2 -3 -2 \0 1 2 1 /
where i denotes the imaginary unit.
(b)
1 + 2 22
3-2i 6
O
2.H.24. Is the set V = {(l,x); x e R} with operations
(B:VxV^V, (l,y) ffi (l,z) = (l,z + y) forallz,ye: Q:RxV^V,   z 0(1, y) = (l,y ■ z)   for all
a vector space?
O
2.H.25. Express the vector (5,1,11) as a linear combination of the vectors (3,2, 2), (2, 3,1), (1,1,3), that is, find numbers
p,q,r e R, for which
(5,1,11) = p (3, 2, 2) + g (2, 3,1) + r (1,1, 3).
O
2.H.26. In R3, determine the matrix of rotation through the angle 120° in the positive sense about the vector (1,0,1) O
2.H.27. In the vector space R3, determine the matrix of the orthogonal projection onto the plane x + y — 2z = 0. O
2.H.28. In the vector space R3, determine the matrix of the orthogonal projection on the plane 2x — y + 2z = 0. O
2.H.29. Determine whether the subspaces U = ((2,1, 2,2)) and V = ((-1,0,-1,2), (-1,0,1,0), (0, 0,1,-1)) of the space R4 are orthogonal. If they are, is R4 = U © V, that is, is [7± = VI
2.H.30. Let p be a given line:
p: [1,1] + (4,l)i, t e R
Determine the parametric expression of all lines q that pass through the origin and have deflection 60° with the line p. O
2.H.31. Depending on the parameter ( £ 8, determine the dimension of the subspace U of the vector space R3, if U is generated by the vectors
(a) ui = (1,1,1),   u2 = (l,t,l),   u3 = (2,2,t);
(b) wi = (t, t, t),   u2 = (-At,-At, At),   u3 = (-2, -2, -2).
135
_CHAPTER 2. ELEMENTARY LINEAR ALGEBRA_
2.H.32. Construct an orthogonal basis of the subspace
((1,1,1,1), (1,1,1,-1), (-1,1,1,1)) of the space R4.
2.H.33. In the space R4, find an orthogonal basis of the subspace of all linear combinations of the vectors (1, 0,1,0),
(0,1,0,-7), (4,-2,4,14).
Find an orthogonal basis of the subspace generated by the vectors (1, 2,2, —1),    (1,1, —5, 3),    (3,2,8, —7).
2.H.34. For what values of the parameters a, b G R are the vectors
(1,1,2,0,0),    (1,-1,0,1, a), (l,&,2,3,-2) in the space R5 pairwise orthogonal?
2.H.35. In the space R5, consider the subspace generated by the vectors
(1,1,-1,-1,0), (1, —1, —1,0, —1), (1,1,0,1,1), (—1,0, —1,1,1). Find a basis for its orthogonal complement.
2.H.36. Describe the orthogonal complement of the subspace V of the space R4, if V is generated by the vectors (—1,2,0,1), (3,1,-2, 4), (-4,1,2,-4), (2, 3,-2, 5).
2.H.37. In the space R5, determine the orthogonal complement W1- of the subspace W, if
(a) W = {(r + s + t, -r + t, r + s, -t, s + t); r, s, t G R};
(b) W is the set of the solutions of the system of equations x\ — x3 = 0, x\ — x2 + x3 — x4 + x$ = 0. 2.H.38. In the space R4, let
(1,-2,2,1), (1,3,2,1)
be given vectors. Extend these two vectors into an orthogonal basis of the whole R4. (You can do this in any way you wish, for instance by using the Gram-Schmidt orthogonalization process.)
2.H.39. Define an inner product on the vector space of the matrices from the previous exercise. Compute the norm of the matrix from the previous exercise, induced by the product you have defined. O
2.H.40. Find a basis for the vector space of all antisymmetric real square matrices of the type 4x4. Consider the standard inner product in this basis and using this inner product, express the size of the matrix
/ 0     3     1 0\
-3012 -1-10 2 \0    -2-2 0/
o
2.H.41.   Find the eigenvalues and the associated eigenspaces of eigenvectors of the matrix:
Solution. The characteristic polynomial of the matrix is A3 — 6A2 + 12A — 8, which is (A — 2)3. The number 2 is thus an eigenvalue with algebraic multiplicity three. Its geometric multiplicity is either one, two or three. We determine the vectors associated to this eigenvalue as the solutions of the system
—Xi      +X2      = 0,
(A - 2E)x = -x1    +x2    = 0, 2xi    -2x2   = 0.
136
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Its solutions form the two-dimensional space ((1, —1, 0), (0,0,1)}. Thus the eigenvalue 2 has algebraic multiplicity 3 and geometric multiplicity 2.
□
2.H.42. Determine the eigenvalues of the matrix
/-13 5    4 2\
0 -10 0
-30 12   9 5
\-12 6    4 1/
O
2.H.43. Given that the numbers 1, —1 are eigenvalues of the matrix
/-ll    5   4 1\ . _    -3    0 10 -21   11   8 2 \-9    5   3 1/
find all solutions of the characteristic equation A — A E =0. Hint: if you denote all the roots of the polynomial A — A E\
by Ai,A2,A3,A4,then
^4 | = Ai ■ A2 ■ A3 ■ A4,   and   ti A = Ai + A2 + A3 + A4.
o
2.H.44. Find a four-dimensional matrix with eigenvalues Ai = 6 and A2 = 7 such that the multiplicity of A2 as a root of the characteristic polynomial is three, and that
(a) the dimension of the subspace of eigenvectors of A2 is 3;
(b) the dimension of the subspace of eigenvectors of A2 is 2
(c) the dimension of the subspace of eigenvectors of A2 is 1
O
2.H.45. Find the eigenvalues and the eigenvectors of the matrix:
/-l -i
5 5
< 3 ,3 3
2.H.46. Determine the characteristic polynomial A — A E |, eigenvalues and eigenvectors of the matrix
/4   -1 6\
2    1    6 . \2   -1 8/
respectively.
O
137
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
Solutions to the exercises
2 A.ll. There is only one such matrix X, and it is
-32 -8
1	10	-4
1	12	-5
0	5	-2
A =
A =
'14   13 -13> 13   14 13 v 0    0     27 y
2.A./5.
2A.16. CT
í2	-3	0	0	
-5	8	0	0	0
0	0	-1	0	0
0	0	0	-5	2
\o	0	0	3	-V
0
1   -1 0
\1 -1 -1
2A.17. In the first case we have 1
-1 0
a-1 =
in the second
2D.7. (2 + ^,2-^3).
2D.8. The vectors are dependent whenever at least one of the conditions
a = b = 1,       a = c = 1,       b = c = 1
is satisfied.
2D.9. Vectors are linearly independent.
2D.10. It suffices to add for instance the polynomial x.
2.F.5. cos
_ V2
2.G.3. Je I a - A £ I = -A3 + 12A2 - 47A + 60,. Ai = 3, A2 = 4, As = 5.
2.G.11. The solution is the sequence 0,1, 2.
2.G.12. The dimension is 1 for Ai = 4 and 2 for A2 = 3.
2.H.8. The solutions are all scalar multiples of the vector
(l + v^, -V3, 0, 1, 0)
2.H.9. XI = 1 + t,    X2 = f,    X3 = t, : 2.H.10. The system has no solution. 2.H.11. The system has a solution, because
re:
2 2
w
r3^
3
-3 V-2/
-1 1
\1/
8 4
w
138
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.H.12. The system of linear equations
3xi	+	2X3	= 1,
XI	+	X3	= 2,
7xi	+	4X3	= 3,
5xi	+	3X3	= 4,
	X2		= 5
2X3 X3 4X3 3X3
1, 1, 1, 1, 1
has no solution, while the system
3xi +
xi +
7xi +
5xi +
X2
has a unique solution xi = —1, x2 = 1, x3 = 2. 2.H.13. The set of all solutions is given by
{(-10t, (a + 4)t, (3a - 8)t) ; t e R}.
2.H.14. For a = 0, the system has no solution. For a ^ 0 the system has infinitely many solutions. 2.H.15. The correct answers are „yes", ,410", „no" and „yes" respectively.
2.H.16. i) If b ± -7, then x = z = (2 + a)/(b + 7), y = (3a - b - l)/(b + 7). ii) If b = -7 and a ± -2, then there is no solution, ill)
If a = — 2 and b = — 7 then the solution is x = z = t, y = 3t — 1, for any t.
2.ff./7.
1\
-1
-1
1
1   1 1 \
1  -1 -1 1-11-1
\i -1-1 iy
We can then easily obtain
13 3
XI = -,      X2 =--,
4 4
1
-1
1
-1
1 / 1
4'
2-.ff.20. i) (1,7)(2,6)(5,3), ii) (1, 6)(6, 8)(8, 7)(7, 3)(2,4), Hi) (1,4)(4,10)(10, 7)(7, 9)(9,3)(2, 6)(6, 5) 2.H.21. i) 17 inversions, odd, ii) 12 inversions, even iii) 25 inversions, odd 2.H.22. From the knowledge of the inverse matrix F~x we obtain
F*
(aS - ßi) F'1
-ß
for any a, j3, 7, 8 e R. 2.H.23. The matrices are
(a)
f1
0
-4\ -1 6
(b)
6
-3 + 2i
-2i 1+i
\ 2      1-6 -10/
2.H.24. It is easy to check that it is a vector space. The first coordinate does not affect the results of the operations - it is just the vector space (R, +, •) written in a different way. 2.H.25. There is a unique solution
2.H.26.
2.H.27.
= 2,	q = -2,	r =
1/4		3/4
	-1/2	-^/4
3/4	V6/4	1/4
f 5/6	-1/6	l/3\
-1/6	5/6	1/3 .
V 1/3	1/3	1/3/
139
CHAPTER 2. ELEMENTARY LINEAR ALGEBRA
2.H.28.
I 5/9    2/9 -4/9\
2/9    8/9 2/9 \-4/9   2/9    5/9 /
2.H.29. The vector that determines the subspace U is perpendicular to each of the three vectors that generate V. The subspaces are thus orthogonal. But it is not true that R4 = U ffi V. The subspace V is only two-dimensional, because
(-1,0, -1, 2) = (-1,0,1,0) - 2 (0,0,1, -1).
2.H.30.
qi : (2 -
V3
2.H.31. In the first case we have dim U = 2 for t e {1, 2}, otherwise we have dim U = 3. In the second case we have dim U = 2 for t # 0 and dim U = 1 for t = 0.
2.H.32. Using the Gram-Schmidt orthogonalization process we can obtain the result
((1,1,1,1), (1,1,1,-3), (-2,1,1,0)).
2.H.33. We have for instance the orthogonal bases
((1,0,1,0), (0,1,0,-7))
for the first part, and
((1, 2, 2, -1), (2, 3, -3, 2), (2, -1, -1, -2)).
for the second part.
2.H.34. The solution is a = 9/2, b = —5, because
1 + 6 + 4 + 0 + 0 = 0,    1-6 + 0 + 3- 2a = 0. 2.H.35. The basis must contain a single vector. It is
(3,-7,1,-5,9).
(or any non-zero scalar multiple thereof.
2.H.36. The orthogonal complement V± is the set of all scalar multiples of the vector (4, 2, 7,0). 2.H.37.
(a) W1- = <(1,0, -1,1,0), (1, 3,2,1, -3));
(b) W1- = ((1,0,-1,0,0), (1,-1,1,-1,1)).
2.H.38. There are infinitely many possible extensions, of course. A very simple one is
(1,-2,2,1),    (1,3,2,1),    (1,0,0,-1), (1,0,-1,1).
2.H.39. For instance, one can use the inner product that follows from the isomorphism of the space of all real 3x3 matrices with the space R9. If we use the product from R9, we obtain an inner product that assigns to two matrices the sum of products of two corresponding elements. For the given matrix we obtain
= Vl2 + 22 + O2 + O2 + 22 + O2 + l2 + (-2)2 + (-3)2 = V23.
2.H.40.
2.H.42. The matrix has only one eigenvalue, namely —1, since the characteristic polynomial is (A + l)4 2.H.43. The root —1 of the polynomial | A — A E \ has multiplicity three. 2.H.44. Possible examples are,
(a)
	0	0	°\		/6		0	0	
0	7	0	0		(b)	0	7	1	0
0	0	7	0			0	0	7	0
\o	0	0	V				0	0	V
				0	0	°\			
		(c)	0	7	1	0			
			0	0 0	7 0	1 V			
140
_CHAPTER 2. ELEMENTARY LINEAR ALGEBRA_
2.H.45. There is a triple eigenvalue —1. The corresponding eigenspace is {(1,0,0), (0, 2,1)).
2.H.46. The characteristic polynomial is — (A — 2)2 (A — 9), that is, the eigenvalues are 2 and 9 with associated eigenvectors
(1,2,0), (-3,0,1)   a (1,1,1)
141
CHAPTER 3
Linear models and matrix calculus
where are the matrices useful? - basically almost everywhere..
A. Linear optimization
Let us start with an example of a very simple problem:
3.A.I. A company manufactures bolts and nuts. Nuts and bolts are moulded - moulding a box of bolts takes one minute, a box of nuts is moulded for 2 minutes. Preparing the box itself takes one minute for bolts, 4 minutes for nuts. The company has at its disposal two hours for moulding and three hours for box preparation. Demand says that it is necessary to manufacture at least 90 boxes of bolts more than boxes of nuts. Due to technical reasons it is not possible to manufacture more than 110 boxes of bolts. The profit from one box of bolts is $4 and the profit from one box of nuts is $6. The company has no trouble with selling. How many boxes of nuts and bolts should be manufactured in order to have maximal profit?
Solution. Write the given data into a table:
	Bolts 1 box	Nuts 1 box	Capacity
Mould	1 min ./box	2 min ./box	2 hours
Box	1 min ./box	4 min ./box	3 hours
Profit	$4/box	$6/box	
We have already developed a useful package of tools and it is time to show some applications of matrix calculus. It might seem that the assumption of linearity of relations between quantities is too restrictive. But this is often not so. In real problems, linear relations may appear directly. A problem may be solved as a result of an iteration of many linear steps. If this is not the case, we may still use this approach at least to approximate real non-linear processes.
We should also like to compute with matrices (and linear mappings) as easily as we can compute with scalars. In order to do that, we prepare the necessary tools in the second part of this chapter. We also present a useful application of matrix decompositions to the pseudoinverse matrices, which are needed for numerical mastery of matrix calculus.
We try to illustrate all the phenomena with rather easy problems. Still some parts of this chapter are perhaps difficult for first reading. This in particular concerns the very first part providing some glimpses towards the linear optimization (linear programming), the third part devoted to iterated processes (the Frobenius-Perron theory) and some more advanced parts of the matrix calculus in the end (the Jodan canonical form, decompositions, and pseudo-inverses of matrices). The reader should feel free to move forward if getting lost.
1. Linear optimization
The simplest linear processes are given by linear mappings p : V —> W on vector spaces. As we can surely imagine, the vector v e V can represent the state of some system we are observing, while ip(v) gives the result after some process is realized.
If we want to reach a given result b e W of such a process, we solve the problem
p(x) = b
for some unknown vector x and a known vector b.
In fixed coordinates we then have the matrix A of a mapping ip and coordinate expression of the vector b. We have mastered such problems in the previous chapter. Now we draw more interesting conclusions in the setup of linear optimization models (called also linear programming).
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Denote by x\ the number of manufactured boxes of bolts and by x2 the number of manufactured boxes of nuts. From the restriction on moulding time and from the restriction on the box preparation we obtain the following restrictive conditions:
xx + 2x2 < 120 Xl + 4x2 < 180
xx > x2 + 90
xx < 110
The objective function (the function that gives the profit for given number of manufactured nuts and bolts) is 4xx+Qx2. The previous system of inequalities defines a region in R2. Optimisation of the profit means finding in this region the point (points) in which the objective function has the maximum value, that is, to find the largest k such that the line 4a;i + Qx2 = k has a non-empty intersection with the given region. Graphically, we can find the solution for example by placing the line p into the plane such that it satisfies the equation 4a;i + Qx2 = 0 and start moving it "upwards" as long as it has some intersection with the area. It is clear that the last intersection is either a point or a line parallel to p forming a border of the region. Thus we obtain (see the figure) the point a;i = 110 and x2 = 5. Maximum possible income is thus 4- 110 + 6- 5 = $470.
				
				
				
	. 1			
	Jo	ho	Ao Hi	
				
-%	A\ i"1			X,M10
□
3A.2. Minimisation of costs for feeding. A stable in Nisovice u Volyne buys fodder for winter: hay and oats. The
3.1.1. Linear optimization. In the practical column, the previous chapter started with a painting problem, and we shall continue here in a similar way. Imagine that our very specialized painter in a black&white world is willing to paint facades of either small family houses or of large public buildings, and that he (of course) uses only black and white colours. He can arbitrarily choose proportions between x\ units of area for the small houses or x2 units for the large buildings. Assume that his maximal workload in a given interval is L units of area, his net income (that is, after subtracting the costs) is c\ per unit of area for small houses and c2 per unit of area for large buildings. Furthermore, he has only W kg of white colour and B kg of black colour at his disposal. Finally, a unit of area for small houses requires wx kg of white colour and b\ kg of black colour. For large buildings the corresponding values are w2 and b2.
If we write all this information as inequalities, we obtain the conditions
(1)
(2)
(3)
xx + x2 < L WxXx + w2x2 < W bxxx + b2x2 < B.
The total net income of the painter, which is the following linear form h,
h(xx, x2) = cxxx + c2a;2,
is to be maximized.
Each of the given inequalities clearly determines a half-plane in the plane of the variables (x x, x 2), bounded by a line given by the corresponding equality, and we must also assume that both xx and a;2 are non-negative real numbers (because the painter cannot paint negative areas). Thus we have constraints for the values (xx, x2) - either the constraints are un-S^ddbTcaaoi1^™ satisfiable, or they allow points inside a polygon with at most ™dX2'; ^dd,theltae,
' J r f   jo 0f constant value for h,
best through one of the vertices with hand-written description "optimal . ., , constant value of h
five vertices. See the diagram.
How to solve such a problem? We seek the maximum value of a linear form h over subsets M of a vector space
143
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
nutritional values of the fodder and required daily portions for one foal are given in the table:
g/kg	Hay	Oats	Requirements
Dry basis	841	860	> 6300 g
Digestible nitrogen stuff	53	123	> 1150g
Starch	0.348	0.868	< 5.35 g
Calcium	6	1.6	>30g
Phosphate	2.8	3.5	<44g
Natrium	0.2	1.4	~7g
Cost	1.80	1.60	
Every foal must obtain in its daily meal at least 2 kg of oats. The average cost (counting the payment for the transportation) is €1.80 per 1kg of hay and €1.60 per 1kg of oats. Compose a daily diet for one foal which has minimum costs. O
The previous three examples could be solved by drawing the diagram and checking all the vertices on the boundary of the polygonal area M C R2. Morevoer, we know that the maximum will be at one of the extremes in the direction of the normal to the defining line from the linear cost function h.
But the principle works in higher dimensions as well. If there is a function / : Rn -> Rn, f{xu ...,xn) = c0 + c1x1 + ■ ■ ■ + cnxn (we call it the objective function), its values in the points A = (ai,..., an) and B = A + u = (a1 + ui,..., an + un) differ by f(A — B) = f(u) = f(ui,..., un) = c\U\ + ■ ■ ■ + cnun, which is the scalar product of the vectors (ci,..., c„) and (wi,..., w„). The relation between the scalar product and the cosine of the angle between vectors ensures that the given function / defines a hy-persurface in Rn with normal (ci,..., c„). This hypersurface splits the space Rn into two half-spaces. Clearly the given function grows if moving towards one of those half-spaces and declines in the other one. This is essentially the same principle as we saw when discussing the visibility of segments in dimension 2 (we checked whether the observer is to the left or to the right from the oriented segment, cf. 1.5.12).
This observation leads to an algorithm for finding the extremal values of the linear objective function / on the set M of admissible points defined by linear inequalities.
We shall deal with the standard problem of linear programming. That is, we want to maximize the linear function h = c1x1 + ■ ■ ■ + cnxn on the set M given by Ax < b and x > 0 (here the inequality between vectors means the inequality between all their individual components). As explained in 3.1.6 we may add slack variables xs, one for each equation.
which are defined by linear inequalities. In the plane, M is given by the intersection of half planes.
Next, note that every linear form over real vector space j^n    h : V —> R (that is, arbitrary linear scalar ,.I-rfT    function) is monotone in every chosen direction. More precisely, if we choose a fixed starting vec-"~£jjiF "' tor u G V and "directional" vector v e V, then composition of our form h with parametrization yields
t i—» h(u + t v) = h(u) + t h(v).
This expression is indeed either increasing or decreasing, or constant (depending on whether h(y) is positive, negative or zero), as a function of t.
Thus, if the set M is bounded as at our picture above, we easily find the solution by testing the value of h at the vertices of the boundary polygon. In general, we must expect that problems similar to the one with the painter are either unsatisfiable (if the given set with constraints is empty), or the profit is unbounded (if the constraints allow for unbounded directions in the space and the form h is non-zero in some of the unbounded directions) or they attain a maximal solution in at least one of the "vertices" of the set M. Normally the maximum is attained at a single point of M, but sometimes it is attained on a part of the boundary of the set M.
Try to choose explicit values for the parameters w1, w2, h, b2, ci, c2, draw the above picture for these parameters and find the explicit solution to the problem (if it exists)!
3.1.2. Terminology. In general we speak of a linear programming problem whenever we seek either the maximum or minimum value of a linear form h over Rn on a set bounded by a system of linear inequalities which we call linear constraints. The vector on the right side is then called the vector of constraints. The linear form h is also called the objective function} In real practice we meet hundreds or thousands of constraints for dozens of variables.
The standard maximization problem is defined by seeking a maximum of the objective function while the restrictive inequalities are < and the variables are non-negative. On the other hand, the standard minimization problem is defined by seeking a minimum of the objective function while the restrictive inequalities are > and the variables are non-negative. It is easy to see that every general linear programming Jst       problem can be transformed into a standard one 2G\k    °f eitner types. Aside from sign changes, we 5&f|js2^ can work with a decomposition of the variables that have no sign restriction into a difference of two non-negative ones. Without loss of generality we will work only with the standard maximization problem.
Leonid Kantorovich and Tjalling Koopmans shared the 1975 Nobel prize in economics for their formulations and solution of economical and logistics problems in a similar way during the second world war. But it was George B. Dantzig who independently developed general linear programming formulation in the period 1946-49, motivated by planning problems in US Air Force. Among others, he invented the simplex method algorithm.
144
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus we may restrict ourselves to the problem of maximizing fcona vector space of solutions of systems of linear equations with the additional condition that all the values of the coordinates must be non-negative.
If there are more general inequalities, we can always change them into our form by multiplying them with — 1 and minimization of value of h corresponds to maximization of -h.
As explained in more details in 3.1.1 and 3.1.6, we add the first row of coefficients of h (with minus signs) and use the simplex tableau:
-Ci .	Cre	0
an •	ai„	bi
"ml	"mm	bn
We start the algorithm if we find m columns (here m is the number of equations in the problem) such that Gauss elimination for these columns leads to a unit submatrix in A and positive values at the positions of all b{. The coordinates corresponding to these columns and values 1 in them are called the basic coordinates. We restrict ourselves to the cases where all b{ are nonnegative in the original problem and then we choose all the slack variables as the basic ones and the initialization of the algorithm is done.
Next we move in the following iterated steps (compare the more theoretical explanation in 3.1.6):
We choose the first column from the left having a non-positive value in the first row. In this column (let it be the j-th column), we pick up the positive entry a{j in A which provides the minimal relation b{ /a{j (we call this entry the pivot). Finally we eliminate the entire chosen column with the help of the chosen a,. This means we achieve by elementary row transformations that the j-th column contains the value 1 at it's j-th row, with all other values vanishing.
We explain the procedure by a example:
3.A.3.   Minimize the function	—3x	-y	— 2z under the con
ditions x,y,z > 0 and			
x  -  y +	z	>	-4,
2x +	z	<	3,
x   +   y +	3z	<	8.
Solution. First we multiply the objective function and the first inequality by —1. We get the equivalent task of maximizing
3.1.3. Formulation using linear equations. Finding an optimum is not always as simple as in the previous 2-dimensional case. The problem can contain many variables and constraints and even deciding whether the set M of the feasible points is non-empty can be a problem.
We do not have the ambition to go into detailed theory here. But we mention at least some ideas which show that the solution can be always found, and then we build an effective algorithm solving the problem in the next paragraphs.
We begin by comparison with systems of linear equations - because we understand those well. We write the equations (l)-(3) in 3.1.1 in the general form:
A ■ x < b,
where x is now an n-dimensional vector, b is an m-dimensional vector and A is the corresponding matrix. By an inequality between vectors we mean individual inequalities between all coordinates. We want to maximize the product c ■ x for a given row vector of coefficients of the linear form h a the feasible values of x. If we add new auxiliary variables xs, one for every equation and add another variable z for the value of the linear form h, we can rewrite the whole system as a system of linear equations
(1)
A
z — C ■ X
A ■ x + xt
where the matrix is composed of the blocks with 1 + n + m columns and 1+m rows, with corresponding individual components of the vectors. We call the new variables xs the slack variables. Moreover, we require non-negativity for all coordinates x and xs.li the given system of equations has a solution, we seek values for the variables z, x and xs, such that all x and xs are non-negative and z is maximized. In paragraph 4.1.11 on page 240 we will discuss this situation from the viewpoint of affine geometry. Now we just notice that being on the boundary of the set of feasible points M of the problem is equivalent to having some of the slack variables vanishing. Our algorithm will try to move from one such position to another while increasing h. But we shall need some conceptual preparation first.
Specifically, in our problem of the black&white painter from 3.1.1, the system of linear equations looks like this:
(A
X\
A	-ci	-c2	0	0	
0	1	1	1	0	0
0	W\		0	1	0
\o	bi	b2	0	0	V
X2 X4
w
L W
W
3.1.4. Duality of linear programming. Consider the real jjj i matrix A with m rows and n columns, vector of con-
straints b and row vector c giving the objective func-As^f ti°n- From this data we can consider two problems of
linear programming for i£t™ and y G Rm.
145
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
the function 3x + y + 2z under the conditions
—x + y — z < 4, 2x +     z   < 3,
x   +   y   +   3z   < 8.
Introducing the non-negative slack variables u, v, w, we obtain the tableau with the objective function 3x + y + 2z + 0 ■
u + 0 ■ v + 0 ■ w:
-3 -1	-2 0	0	0	0
-1 1	-1 0	0	0	4
© o	1 0	0	0	3
1 1	3 0	0	0	8
Since the right-hand column is non-negative, setting u = 4, v = 3, w = 8, x = y = z = 0 provides an admissible solution to the system which corresponds to the choice of the basic variables u, v, w, and the algorithm may begin.
The first column is already with a negative entry in the first row, so we choose this one. We have circled the pivot, i.e. the value two there (we compare the relations of the elements and those in the last column, i.e. § and |, and take the minimal one, since we need to keep the last column positive during the elimination). Next we eliminate the first column with the help of the pivot (we multiply the third row by \, and subtract its reasonable multiples from the other rows, not forgetting the first row of the tableau, so that only zero entries remain there):
0	-1	1 2	0	3 2	0	9 2
0	©	1 2	0	1 2	0	11 2
0	0	1 2	0	1 2	0	3 2
0	1	5 2	0	1 2	0	13 2
Now the basic variables are x = 3/2, u = 11/2, w = 13/2, which reflects the fact that we moved as much from the former slack variable v to the new basic variable x as possible. This increased the value of the objective function, which we may read in the right top corner of the tableau.
Next, we choose the pivot from the second column and the above rule yields the first row in A (^ < ^). We have already circled the 1 in the tableau above. We eliminate:
0	0	-1	1	2	0	10
0	0	1 2	1	1 2	0	11 2
0	0	1 2	0	1 2	0	3 2
0	0	©	-1	-1	0	1
Dual problems of linear programming
Maximization problem: Maximize c ■ x under the conditions A ■ x < b and x > 0.
Minimization problem: Minimize yT ■ b under the condition yT ■ A > c and y > 0.
We say that these two problems are dual problems of linear programming. Before deriving further properties of linear programming we need some terminology.
We say that the problem is solvable if there is an admissible vector x (or admissible vector yT) which satisfies all constraints. A solvable maximization (minimization) problem is bounded, if the objective function is bounded from above (bellow) over the set of admissible vectors.
Lemma (Weak duality theorem). If x e K™ is an admissible vector for the standard maximization problem, and if y £ Rm is an admissible vector for the dual minimization problem, then
c ■ x < yT ■ b
Proof. It is a simple observation. Since x > 0 and c < yT ■ A, it follows that c ■ x < yT ■ A ■ x. But also y > 0 and A ■ x < b, hence
c ■ x < yT ■ A ■ x < yT ■ b, which is what we wanted to prove. □
We see immediately that if both dual problems are solvable, then they must be bounded. Even more interesting is the following corollary, which is directly implied by the inequality in the previous proof.
Corollary. If there exist admissible vectors x and y of dual linear problems such that for the objective functions c ■ x = yT ■ b, then both are optimal solutions for the corresponding problems.
3.1.5. Theorem (Strong duality theorem). If a standard problem of linear programming is solvable and bounded, then its dual is also bounded and solvable. There exists an optimal solution for each of the problems, and the optimal values of the corresponding objective functions are equal.
Proof. As already proved in the latter corollary, once it is established that the values of the objective functions for the dual problems equal, we have the required optimal solutions to both problems. It remains to prove the other implication, i.e. the existence of an optimal solution under the assumptions in the theorem, as well as the fact, that the objective functions share their values in such a case. This will be verified by delivering an efficient algorithm in the next paragraph. □
We notice yet another corollary of the just formulated duality theorem:
Corollary (Equilibrium theorem). Consider two admissible vectors x and y for the standard maximization problem and its dual problem as defined in 3.1.4. Then both vectors are
146
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Again, we have shifted the new basic variable from u to y and the objective function increased. The next pivot will be the circled number 3 in the third column and fourth row:
0     0 0
o   0 0
□   0 0
0    o 0
5
6
1
6
_ 1
3
1
6
_ 1
6
1
3
31 3
17
3
4 3
1
3
This is the resulting tableau, where the basic variables area; = !,y=^,z=! and their values are read from the last column. Notice that all the original variables are among the basic ones and their values are non-zero. This is not always the case, see the example 3.A.1 above and its explanation via this algorithm in 3.1.6. The maximal value ^ for the objective function is now in the right top corner.
As mentioned in the theoretical explanation, the final tableau also provides the solution of the dual problem, i.e. the minimization of 4u + 3v + 8w under the condition
u   +   2v   +    w   < 3, —u +    w   > 1,
u   +    v   +   3w   > 2.
According to the strong duality theorem (see 3.1.5), the minimal value is again while the corresponding values of the variables u, v and w are read off the first row in the corresponding columns: u= \,v = §, u> = §■
You may check directly that the numbers c4, c5, c6 in the first row of the tableau and the value h in the top right corner satisfy 4c4 + 3cs + 8cg = h. Indeed, the numbers tell how many times the appropriate row (the one with original value 1) has been added. Thus we obtain the right linear combination for h. □
3.A.4. Some game theory. Imagine a game played by two v players - a billionaire and fate. The billionaire would like to invest into gold, silver, diamonds or stocks of an important IT software company. The wins and losses of such investments are well known for the last four years (for simplicity, we consider only the last four years and write them into the matrix A = (a^-)):
gold   silver   diamonds software
2001	2%	1%	4%	3%
2002	3%	-\%	-2%	6%
2003	1%	2%	3%	-4%
2004	-2%	\%	2%	3%
optimal if and only if yi = Ofor all coordinates with index i for which *YTj=\aijxj < hi and simultaneously x j = Ofor all coordinates with index j such that JZ^li Viaij > ci-
Proof. Suppose both relations regarding the zeros among Xi and yi are true.
Since the summands with strict inequality l^Shlyi;,   have zero coefficients, we have
Yy*h* = Yy*Y anxj = Y Y y^x3
i=l i=l      j=l i=l j=l
and for the same reason
m    n n
Y^^ Y^^ Viaijxi — Y^j cixr i=lj=l j=l
This shows one implication, by the duality theorem.
Suppose now that both x and y are optimal vectors. Then
m m    n n
Y!/h YY!/m*Y
i=l i=l j=l j=1
But the left- and right-hand sides are equal, and hence there is equality everywhere. If we rewrite the first equality as
= 0,
Yy*[b*-Yavx3 i=i ^ j=i
then we see that it can be satisfied only if the relation from the statement holds. But it is a sum of non-negative numbers and equals zero. From the second equality we similarly derive the second part and the proof is finished. □
The duality theorem and equilibrium theorem are useful when solving linear programming problems, because they show us relations between zeros among the additional variables and the fulfillment of the constraints. As usual, it is good to know that the problem is solvable in principle and to have some theory related to that, but we still need some clever ideas to make it all into an efficient algorithmic procedure. The next paragraph will provide some insight to this.
3.1.6. The algorithm. As already explained, the linear pro-fjj i gramming problem of maximizing the linear objec-'rvw tiye function h = cx under the conditions Ax < b can be turned into solving the system of equations (1) ^ in 3.1.3, where we added the slack variables xs. If all entries in b are non-negative, then the choice of xs = b and x = 0 provides an admissible solution of the system with the value of the objective function h = 0. This is the choice of the origin x = 0 as one of the vertices of the distinguished region M of the admissible points. We can understand this as choosing the variables xs as the basic variables, whose values are given by the right hand sides of the equation, while all the other variables are set to zero.
In the general case (allowing for negative entries in b), we shall see in 4.1.11 that we always can find an admissible vertex. That is, the choice of the basic variables in the above
147
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The billionaire would like to invest for one year only. How should he split his investment in order to ensure the maximal win independently of development on the stock market? We assume that next year will be some (unknown) probabilistic mix of the previous four ones. In terms of our game, fate will play some stochastic vector (x1, x2,x3,x4) fixing the behaviour of the market (as a probabilistic mixture of the previous ones), while the billionaire will play another stochastic vector (2/1,2/2,2/3,2/4) describing the split of his investment. The win of the billionaire is     j=i xiVjaij •
Solution. The task is to find the stochastic vector (2/1,2/2,2/3,2/4), which will maximize the minimum of all values J2i j=ixiyjaij f°r the fixed matrix A and any stochastic vector (xi,x2,x3,x4).
A very observant reader could imagine that this task is equivalent to the problem of maximizing z\ + 22 + 23 + 24 under the condition ATz < (1, ..., 1)T, 2 > 0 (and the requested stochastic vector y is then obtained by normalizing the vector 2, the requested optimal value is the inverse of the optimal value obtained). 1
Thus, we have to solve a linear programming problem. We introduce the slack variables w1, w2, w3, w4, and transform the problem to the standard form
max {21 + 22 + 23 + 24 I (AT\E4) (z,w) = (1,1,1,1)T} . We work with the table:
	-1	-1 -	-1	-1 0	0	0	0	0	
	2	3	1	-2 1	0	0	0	1	
	T]	-1	2	1 0	1	0	0	1	
	(4)	-2	3	2 0	0	1	0	1	
	3	6 -	-4	3 0	0	0	1	1	
0	3 2	1 4		-\ 0	0		1 4	0	1 4
0	4	1 2		-3 0	0		1 2	0	1 2
0	1 2	5 4		\ 0	0		1 4	0	3 4
LH	1 2	3 4		1 0	0		1 4	0	1 4
0	®	25 4		1 0	0		3 4	0	1 4
sense, describing an admissible solution. Next, we shall assume to have such a vertex already.
The idea of the algorithm is to perform equivalent row transformations of the entire system in such a way, that we move to other vertices of the region M and the function h increases. In order to move to more interesting vertices in M, we must bring some of the slack variables to zero while the appropriate column for the unit matrix would move to one of those columns corresponding to the variables x. A simple check reveals that in order to do this, we must choose some of the negative entries in the first line of the matrix 3.1.3(1), pick up this column and choose a line in such a way that using the Gaussian elimination to push the other entries in this particular column to zero, the right hand sides of the equations remain non-negative. The latter condition means that we have to choose the index i such that &;/ay is minimal. This entry in the matrix is called the pivot for the next step in the elimination. Of course, the non-positive coefficients a{ j are not taken into consideration, since they would not lead to any increase in the objective function. When there are no more negative entries in the first row, we are finished, and the claim is that the optimal value of h appears in the right hand top corner of the matrix.
The reader should think of all the above claims in de-\\ tail and check whether the algorithm must terminate. But the most striking point is the following: The slack variables parts of the matrix <v are closely linked to the dual linear programming problem, and there is an invariant of the entire procedure: Writing (—c,cs,h) for the current first line in the matrix and (x,xs) for the current values of the variables, we obtain c ■ x = cs ■ b = h at each step (check this!). In particular at the moment of the termination of the above algorithm, the coefficients y = cs in the first row represent admissible values of the dual problem (while the values c stay for the slack variables in the dual problem), and the right hand top corner provides the value of the corresponding objective function y-b. Since the two objective functions are equal, we know that the algorithm provides the optimal solution. Great! (But check all the details.)
We show how all this works for the simple problem from 3.A.I. In practice, the very first column of the matrix in question does not change during the procedure at all, so we can omit it completely. Thus we deal with the matrix:
-4	-6	0	0	0	0	0
1	2	1	0	0	0	120
1	4	0	1	0	0	180
-1	1	0	0	1	0	-90
1	0	0	0	0	1	110.
^The observation comes from the proof of the von Neumann Minimax theorem, 1928. The theorem claims that any probabilistic extension of a matrix game enjoys an equilibrium state.
We cannot find an admissible solution by fixing xs as the basic variables here, since there are negative values in b. We try to initiate the above algorithm by changing the sign in the last but one row and performing the Gaussian elimination for the
148
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
0	0	3 2	1 5	0	0	1 10	1 5	3 10
0	0	©	19 5	0	0	1 10	8 15	11 30
0	0	5 6	3 5	0	0	3 10	1 15	23 30
0	0	1 3	3 5	0	0	l 5	1 15	4 15
0	0	5 6	1 5	0	0	l 10	2 15	1 30
0	0	0 -	188 85	9 17	0	4 85	7 85	42 85
0	0	0 -	114 85	6 17	0	3 85	16 85	11 85
0	0	0	146 85	5 17	0	23 85	19 85	56 85
0	0	0 (	89\ 85J	2 17	0	18 85	11 85	19 85
0	0	0	78 85	5 17	0	11 85	2 85	12 85
188 89	0	0	0	25 89	0	44 89	17 89	86 89
114 89	0	0	0	18 89	0	21 89	2 89	37 89
146 89	0	0	0	9 89	0	55 89	1 89	26 89
85 89	0	0	0	10 89	0	18 89	11 89	19 89
78 89	0 o		0	17 89	0	5 89	8 89	30 89
The last table is already the optimal one, since there are no negative values in the first row. We can read off the optimal solution: z2 = §§,23 = z4 = §§> zi = 0. The optimal value (upper right corner) is 24 + z2 + 23 + 24 = §§. After rescaling to a stochastic vector (multiplying with ||) we get
the solution of the original problem: yi = 0, y2
30
y3
37
fg> Va = §§■ with the optimal value ||.
B. Difference equations
Distinct linear dependences can be an excellent tool for describing various models of growth. We begin with a very popular population model that uses a linear difference equation of second order:
3.B.I. Fibonacci sequence. In the beginning of spring, a
□
very first column aiming to have only the 1 in the last but one row there. We obtain:
0	-10	0	0	-4	0	360
0	©	0	0	1	0	30
0	5	0	0	1	0	90
0	-1	0	0	-1	0	90
0	1	0	0	1	0	20
We choose the boxed entries for the basic variables, this represents the values x1 = 90, x2 = 0, x3 = 30, x4 = 90, x5 = 0, x6 = 20, and h = 440 = 4 ■ 90 = -4 ■ (-90) which is an admissible solution. We have also circled the pivot for the next step, i.e. the element in the second column which we want to replace with 1 and eliminate the rest of the column (remember this is the one yilding the smalest ratio with the last right hand column entry among the positive elements -30/3 = 10 which is less then 90/5 = 16 and 20/1 = 20). This leads to the next admissible vertex in our region M and, of course the value for h will increase:
0 0
0 0
0
0
0
0 0 0
^ 0
0
1
3 _5 3
1
3
_ 1
3
0
0
0 0
© 0
460
10 40 100 10
with xi = 100, x2 = 10, x3 = x5 = 0, x4 = 40, x6 = 10, and h = 460 = 4 ■ 100 + 6 ■ 10 = f ■ 120 - § ■ (-90). We still have one of the entries in the first line negative. We circled the next pivot leading to
0	0	9 3	0	0	1	470
0	0	1 2	0	0	1 2	5
0	0	-2	0	0	1	50
0	0	0	0	0	1	110
0	0	1 2	0	0	3 2	15
with the final values x\ = 110, x2 = 5, x3 = 0, x4 = 50, x5 = 15, xG = 0, and
h = 470 = 4 ■ 110 + 6 ■ 5 = ^ ■ 120 + 1 ■ 110.
Let us remind why we can be sure that this is the optimal solution. Thanks to fact that the first line is exclusively non-negative, we have got admissible solution of the dual problem which leads to the same value as the solution of the original one. Thus the equillibrium theorem claims we are done!
stork brought two newborn rabbits, male and fe-   3>L7> Notes about Uneai. models in economy. Our simple
male, to a meadow. The female, after being two months old, is able to deliver two newborns, male and female. The newborns can then start delivering after one month and then every month. Every female is pregnant for one month and then she delivers. How many pairs of rabbits
scheme of the black&white painter from the paragraph 3.1.1 can be used to illustrate one of the typical economical models, the model of production planning. The model tries to capture the problem completely, that is, to capture both external and internal relations. The left-hand sides of the equations (1), (2), (3) in 3.1.1, and
149
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
will be there after nine months (if none of them dies and none "move in")?
Solution. After one month, there is still one pair, but the female is already pregnant. After two months, first newborns are delivered, thus there are two pairs. Every next month, there are that many new pairs as there were pregnant females one month before, which equals to the number of at least one month-old pairs, which equals the number of pairs that were there two months ago. The total number of pairs pn after n months is thus the sum of the number of pairs in the previous two months. For the number of pairs we thus have the following homogeneous linear recurrent formula
(1)
Pn+2 = Pn+1 +Pn,
1,
which, along with the initial conditions p\ = 1 and p2 = 1, uniquely determines the number of pairs of rabbits at the meadow in individual months. Linearity of the formula means that all members of the sequence (p„) appear to the first power. Hopefully the meaning of the word recurrence is clear. For the value of the n-th member we can derive an explicit formula. In searching for the formula we can use the observation that for certain r the function rn is a solution of the difference equation without initial conditions. This r can be obtained by substitution into the recurrent relation:
rn+2   _   rn+i   rn   ^ after dividing by r" we obtain
r2   =   r + 1.
This is the characteristic equation of the given recurrent formula. Thus our equation has roots 1~2V^ and 1+2V^ and the sequences an = i1^)71 and bn = Q^-)11, n > 1 satisfy the given relation. The relation is also satisfied by any linear combination, that is, any sequence cn = san + tbn, s,ieR, The numbers s and t can be chosen so that the resulting combination satisfies the initial conditions, in our case c1 = 1, c2 = 1. For simplicity, it is convenient to define the zero-th member of the sequence as cq = 0 and compute s and t from
the equations for c0 and c\. We find that s = and thus
(1 + V5)n - (1 - V5)n
~s/5'
V5
(2)
Pn
2™(v/5)
Such a sequence satisfies the given recurrent formula and also the initial conditions c0 = 0, c1 = 1. Hence it is the unique sequence given by these requirements. Note that the value of pn in the formula (2) is an integer for any natural n (all terms
the objective function h(xi,x2) express various production relations. Depending on the character of the problem, we have on the right-hand sides either exact values (and so we solve equations) or capacity constraints and goal optimization (then we obtain linear programming problems).
Thus in general we can solve the problem of source allocation with supplier constraints and either minimize costs or maximize income. We can also interpret duality from this point of view. If our painter would like to quantify his efforts related to the total amount of his work by hl per unit, the white colour painting adds yw, while the additional work related to the black colour is yB, then he minimizes the objective function
L-yL + Wyw + ByB
with constraints
y-l + wxyw + b1yB > c1 yi + w2yw + b2yB > c2.
But that is exactly the dual problem to the original one and the theorem 3.1.5 says that the optimal state is when the objective functions have the same value.
Among economical models, we can find many modifications. One of them is the problem of financial planning, which is connected to the optimization of portfolio. We are setting up a volume of investment into individual investment possibilities with the goal to meet the given constraints for risk factors while maximizing the profit, or dually minimize the risk under the given volume.
Another common model is marketing application, for instance allocation of costs for advertisement in various media or placing advertisement into time intervals. Restrictions are in this case determined by budget, target population, etc.
Very common are models of nutrition, that is, setting up how much of different kinds of food should be eaten in order to meet total volume of specific components, e.g. minerals and vitamins.
Problems of linear programming arise with personal tasks, where workers with specific qualifications and other properties are distributed into working shifts. Common are also problems of merging, problems of splitting and problems of goods distribution.
2. Difference equations
We have already met difference equations in the first chapter, albeit briefly and of first order only. Now we consider a more general theory for linear equations with constant ■a//* ■ coefficients. This not only provides very practical tools but also represents a good illustration for the concepts of vector spaces and linear mappings.
150
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
in the Fibonacci sequence are integers), although it might not seem so at the first glance. □ We do some exercises about solving linear difference equation of the second order with constant coefficients. The sequence satisfying the given recurrence equation of the second order is uniquely determined whenever we prescribe any two neighbouring members. Note a further use of complex numbers: to determine the explicit formula for the n-th member of the sequence of real numbers we might require calculations with complex numbers. This happens when the characteristic polynomial of the difference equation has complex roots.
3.B.2. Find an explicit formula for the sequence satisfying the following linear difference equation with the initial conditions:
xn+2 = 2xn + n, xx = 2, x2 = 2. Solution. The homogeneous equation is
%n+2 — 2xn.
Its characteristic polynomial is x2 —2, its roots are ±\f2. The solution of the homogeneous equation is of the form
a(v/2)n + b(-V2)n, foranya,&£R.
We look for the particular solution using the method of indeterminate coefficients. The non-homogeneous part of the equation is a linear polynomial n. Thus a particular solution will be of the form of a linear polynomial in the variable n. That is, kn + /, where k,l £ R. By substituting into the original equation we obtain
k(n + 2) + / = 2(kn + /) + n.
By comparing the coefficients of the variable n on both sides of the equation, we obtain the relation k = 2A+1, that is, k = —1. By comparing the absolute terms we obtain 2k + / = 21, that is, / = —2. Thus the particular solution is the sequence
-n - 2.
Thus the solution of the non-homogeneous difference equation of the second order without initial condition is of the form
a(V2)n + b(-V2)n — Ti — 2, a, b £ R
Homogeneous linear difference equation of order k
3.2.1. Definition. A homogeneous linear difference equation (or homogeneous linear recurrence) of order k is given by the expression
a0xn + aixn-i-\----+akxn-k = 0,    a0 ^ 0   ak ^ 0,
where the coefficients a{ are scalars, which can possibly depend on 7i.
We usually denote the sequence in question as a function
xn = f(n) = - — f(n - 1)-----—f(n - k).
a0 a0
A solution of this equation is a sequence of scalars x{, for all i £ N (or i £ Z), which satisfy the equation with any
71.
By giving any k consecutive values xi in the sequence,
jst ; all other values of x{ are determined uniquely. Indeed, we work over a field of scalars, thus the K&tgS^:^ values a0 and ak are invertible and hence, using the recurrent definition, any xn can be computed uniquely from the preceding k values, and similarly for xn_k. Induction thus immediately proves that all remaining values are uniquely determined.
The space of all infinite sequences x{ forms a vector space, where addition and multiplication by scalars works coordinate-wise. The definition immediately implies that a sum of two solutions of a homogeneous linear difference equation or a multiple of a solution is again a solution. Analogously as with homogeneous linear systems we see that the set of all solutions forms a subspace.
Initial conditions on the values x0,..., xk_1 of the solution reprezent a fc-dimensional vector in Kk. The sum of initial conditions determines the sum of the corresponding solutions, similarly for scalar multiples. Note also that substituting zeros and ones into initial k values immediately yields k linearly independent solutions of the difference equation. Thus, although the vectors are infinite sequences, the set of all solutions has finite dimension. The dimension equals the order of the equation k. Moreover, we can easily obtain a basis of all those solutions. Again we speak of the fundamental system of solutions and all other solutions are its linear combinations.
As we have just checked, if we choose k indices i, i + 1,... ,i + k — lin sequence, the homogeneous linear difference equation gives a linear mapping Kk —> K°° of fc-dimensional vectors of initial values into infinitely-dimensional sequences of the same scalars. The independence of such solutions is equivalent to the independence of the initial values - which can be easily checked by a determinant: If we have a fc-tuple of solutions (x1^,..., x$), it is independent if and only if the following determinant,
151
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Now, by substitution in the initial conditions, we determine the indeterminate a,b e K. To simplify the calculation, we use a little trick: from the initial conditions and the given recurrence relation we compute the member x0 : x0 = \(x2 — 0) = 1. The given recurrence formula along with the conditions x§ = \ and x1 = 1 is then clearly satisfied by the same formula that satisfies the original initial conditions. Thus we have the following relations for a, b:
x0:      a(v/2)° + &(-V/2)°-2 = l,   thus a + b = 3, xx :      V2a - V2b = 5,
whose solution gives us a = 6+^, b = 6~54^. The solution is thus the sequence
□
3.B.3. Determine the basis of the space of all solutions of the homogeneous difference equation
Express your solution in terms of real valued functions.
Solution. The characteristic polynomial of the given equation is x4 — .
sometimes called the Casoratian, is non-zero for one n
x + 1. If we are looking for its roots, we
solve the equation
x4 -x3 - a; + 1 = 0
The left side factors as
(a;-l)2(a;2 + a; + l)
= cos(27r/3) +
with two complex roots x\
isin(27r/3) andx2 = — \— = cos(27r/3) — isin(27r/3) and a double root 1. Thus the basis of the vector space of the sequences that are a solution of the difference equation in question is the following quadruple of se-quences: {(-± + V3)"}~i, {(-§ - V3)"}~i, {l}$£Li (constant sequence) and {n}n°=1. If we are looking for a basis of real valued functions, we must replace two of the generators (sequences) from this basis by some sequences that are real only. As these generators are power series whose members are complex conjugates, it suffices to take as suitable generators the sequences given by the half of the sum and by the half of the i-th multiple of the difference of that complex generators. This yields the following real
Ln+k-l
[k]
Cn+1
[k] Cn+k-l
7^0
which then implies the non-vanishing of Cn for all n.
3.2.2. Recurrences with constant coefficients. It is difficult to find a universal mechanism for finding a solution (that is, a directly computable expression) of general homogeneous linear difference equations. We shall come back to this problem in the end of chapter 13.
In practical models there are very often equations, where ytsv the coefficients are constant. In this case it is ifti^£ possible to guess a suitable form for the solu-y^AjpQfe tion and indeed to find k linearly independent ^5ff^»li_ solutions. This would then be a complete solution of the problem, since all other solutions would be linear combinations of them.
For simplicity we start with equations of second order. Such recurrences are very often encountered in practical problems, where there are relations based on two previous values. A linear difference equation (recurrence) of second order with constant coefficients is thus a formula
(1)
f{n + 2) = af{n + \)+bf{n)+c,
where a,b,c are known scalar coefficients.
Consider a population model. We assume that the individuals in a population mature and start breeding two seasons later (that is, they add to the value f(n + 2) by a multiple b j(n) with positive b > 1), while immature individuals at the same time weaken and destroy part of the mature population (that is, the coefficient a at f(n+l) is negative). Furthermore, it might be that somebody destroys (uses, eats) a fixed amount c of individuals every season.
A similar situation with c = 0 and both other coefficients positive determines the famous Fibonacci sequence of numbers y0, yi,..., where yn+2 = yn+i + yn, see 3.B.I.
If we have no idea how to solve a mathematical problem, we can always blindly try some known solutions of a similar problems. Thus, let us substitute into the equation (1) with coefficient c = 0 a similar solution as with the linear equations from the first chapter (cf. 1.2.1), that is, we try j(n) = \n for some scalar A. By substitution into the equation we obtain
A
n+2
a A'
n+l
b\n = A™ (A2 -aX-b) = 0.
This relation will hold either for A = 0 or for the choice of the values
Ai = \{a+ y/a? + 4b),   A2 = \{a - \/a? +4b).
It is easy to see that such solutions work. We just had to choose the scalar A suitably. But we are not finished, since we want to find a solution for any two initial values /(0) and
152
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
basis of the solution space:  {l}^Li (constant sequence),
{n}~ !, {cos(2n7r/3)}~ lf {sin(2n7r/3)}~□
3.B.4.   Solve the following difference equation:
-^n-t-2 + -^n+l -^n-
Solution. From the theory we know that the space of the solutions of this difference equation is a four-dimensional vector space whose generators can be obtained from the roots of the characteristic polynomial of the given equation. The characteristic polynomial is
x4 - x3 + x2 - x + 1 = 0.
It is a reciprocal equation (that means that the coefficients at the (n — fc)-th and fc-th power of x, k = 1,..., n, are equal). We can use the substitution u = x + \. After dividing the equation by x2 (zero cannot be a root) and substituting (note that x2 + ^ = u2 — 2) we obtain
x2-x + l- - + ^r=u2-u-l = 0. X xz
Thus we obtain the indeterminates 1*1,2 = 1±2V^. From there then by the equation x2 — ux + 1 = 0 we determine the four roots
1 ± V5± ^-10 + 2^
2+2,3,4 = -^-•
Note that the roots of the characteristic equation could have been "guessed" right away since
x5 + l = (x + l)(x4 - x3 + x2 - X + 1).
Thus the roots of the polynomial x4 —x3 +x2 —x+l are also the roots of the polynomial x5 +1, which are exactly the fifth roots of the —1. By this we obtain that the solutions of the characteristic polynomial are the numbers 21,2 = cos^) ± j sin (I) and 2:3,4 = cos^) ± i sin(3^1). Thus the real basis of the space of the solution of the given difference equation is for instance the basis of the sequences cos(I|L), sin(I|I), cos(222i) and sin(222L), which are sines and cosines of the arguments of the corresponding powers of the roots of the characteristic polynomial.
Note that we have incidentally derived the algebraic expressions for cos(f)  = sin(f)  =   ^10-2^^
cos(ir) = andsin(^) = ^10+2^_ tnis is because
all the roots of the equation have absolute value 1, they are the real and imaginary parts of the corresponding roots). □
/(l). So far, we have only found two specific sequences satisfying the given equation (or possibly even only one sequence if A2 = Ai).
As we have already derived for linear recurrences, the sum of two solutions fi(n) and f2(n) of our equation f(n + 2) — a f(n + 1) — b j(n) = 0 is again a solution of the same equation. The same holds for scalar multiples of the solution. Our two specific solutions thus generate the more general solutions
f(n) = dA? + C2\n2
for arbitrary scalars Ci and C2. For a unique solution of the specific problem with given initial values /(0) and /(l), it remains only to find the corresponding scalars Ci and C2.
3.2.3. The choice of scalars. We show how this can work with an example. Consider the problem:
1
(1) Vn+2 = Vn+1 + ^Vn,     VO = 2, yi = 0.
Here Ai;2 = |(1 ± v^) and clearly
y0 = Ci + C2 = 2
yi = \ci(\ + Vž) + \c2(\-Vž)
is satisfied for exactly one choice of these constants. Direct calculation yields C\ = 1 — \\ři>, C2 = 1 + and our problem has unique solution
fin) = (l-\VŽ)^(l + VŽT + (l + \VŽ)^(l-VŽ)\
Note that even if the found solution for our equation with rational coefficients and rational initial values looks complicated and is expressed with irrational numbers, we know a priori that the solution itself is again rational. But without this "step aside" into a larger field of scalars, we would not be able to describe the general solution.
We will often meet similar phenomena. Moreover, the general solution often allows us to discuss qualitative behaviour of the sequence of numbers j(n) without direct enumeration of the constants. For example, we may see whether the values approach some fixed value with increasing n or oscillate in some interval or whether they are unbounded.
3.2.4. General homogeneous recurrences. We substitute
xn = A™ for some (yet unknown) scalar A into the general homogeneous equation from the definition 3.2.1 (with constant coefficients). For every n we obtain the condition
\n-k(a0\k + a^-1 ■ ■ ■ + ak) = 0.
This means that either A = 0 or A is the root of the so-called characteristic polynomial in the parentheses. The characteristic polynomial is independent of n.
Assume that the characteristic polynomial has k distinct roots Ai,..., Afc. For this purpose, we can extend the field of scalars we are working in, for instance Q into R or C. Of course, if the inicial conditions are in the original field then
153
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.B.5. Determine the explicit expression of the sequence satisfying the difference equation xn+2 = 2xn+1 — 2xn with initial values x\ = 2,x2 = 2.
Solution. The roots of the characteristic polynomial x2—2x+ 2 are 1 + i and The basis of the (complex) vector space of the solution is thus formed by the sequences yn = (1 + i)n and zn = (1 — i)n. The sequence in question can thus be expressed as a linear combination of these sequences (with complex coefficients). It is thus xn = a ■ yn + b ■ zn, where a = ai + ia2, b = b\ + ib2. From the recurrent relation we compute x0 = \(2x\ — x2) = 0 and by substitution n = 0 and n = l into the expression of xn we obtain
1 = x0   =   ai + ia2 + bi + ib2
2 = xx   =   (a1+ia2)(l+i) + (b1+ib2)(l-i).
By comparing the real and the complex part of both equations, we obtain a linear system of four equations with four indeterminates
o-i + h   = 1
a2 + b2 = 0 o-i — a2 + b1 + b2 = 2 ai + a2 — b1 + b2   = 0.
These equations imply that a1 = b\ = b2 = | and a2 = —1/2. Thus we can express the sequence in question as
The sequence can also be expressed using the real basis of the (complex) vector space of the space of solutions, that is, using the sequences un = ^(yn + zn) = (\/2)n cos(Ij:) and vn = \i(zn — yn) = (V2)n sin(Ij:). The transition matrix for the changing the basis from the complex one to the real one is
the inverse matrix is T 1 = j, for expressing the se-
quence xn using the real basis, that is, for expressing the coordinates (c, d) of the sequence xn under the basis {«„, vn}, we have
the solutions stay there since the recurrence equation itself does. Each of the roots gives us single possible solution
We need k linearly independent solutions.
Thus we should check the independence by substituting k values for n = 0,..., k — 1 for k choices of A, into the Casoratian (see 3.2.1). Thus we obtain the Vandermonde matrix. It is a good but not entirely trivial exercise to show that for every k and any fc-tuple of distinct A, the determinant of such a matrix non-zero, see 2.B.7 on the page 92. It follows that the chosen solutions are linearly independent.
Thus we have found the fundamental system of solutions of the homogeneous difference equation in the case that all the (possibly complex) roots of its characteristic polynomial are distinct.
Now we suppose A is a multiple root. We ask whether xn = n\n could be a solution. We arrive at the condition
a0n\n H-----h ak(n - k)\n~k = 0.
This condition can be rewritten as
A(a0Are + •
+ ak\n-ky = o
where the dash denotes differentiation with respect to A (cf. the infinitesimal definition in 5.1.6, and 12.2.7 for the purely algebraic treatment).
Moreover, a root c of a polynomial / has multiplicity greater than one if and only if it is a root of /', see 12.2.7 for the proof. Our condition is thus satisfied.
With greater multiplicity £ of the root of the characteristic polynomial we can proceed similarly and */ use the (now obvious) fact that a root with multiplicity £ is a root of all derivatives of the polynomial up to order £—1 (inclusively). Derivatives look like this:
f(\) = a0\n + --- + ak\n-k /'(A) = aonX71'1 + ■■■ + ak(n - fc)An~fe_1 /"(A) = a0n(n-l)\n-2+- ■ ■ +ak{n-k){n-k-l)\n-k-2
fW =a0n---{n-£ + \)\n-1 + ■■■
+ ak(n -k)---(n-k-£ + l)\n-k-e.
We look at the case of a triple root A and try to find a solution in the form n2\n. By substitution into the definition, we obtain the equation
a0n A™ H----+ ak(n - k) X
2 \n—k
0.
T"
Clearly the left side equals the expression A2/"(A) + A/'(A) and because A is a root of both derivatives, the condition is satisfied.
Using induction, we prove that even for the general condition of the solution in the form xn = ne\n,
a0neXn + ...ak(n- k)eXn-k = 0,
154
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
thus we have again an alternative expression of the sequence xn where there are no complex numbers (but there are square roots):
Xn = (^Tcos (IE)+(v5)"sin(IE).
We could have obtained these by solving two linear equations in two variables c, d, that is, 1 = x0 = c ■ u0 + d ■ v0 = c and
2 = x\ = c ■ ui + d ■ i>i = c + d. □
3.B.6. A simplified model for the behaviour of gross domestic product. Consider the difference equation
(1)
Vk+2 - a(l + b)yk+1 + abyk = 1,
where yk is the gross domestic product at the year k. The /) /^C^S^n constant a is the consumption tendency, / XT  \ which is a macro economical factor that
' ^   gives the fraction of money that the peo-
ple spend (from what they have at their disposal). The constant b describes the dependence of the measure of investment of the private sector on the consumption tendency.
Further, we assume that the size of the domestic product is normalised such that the right-hand side of the equation is 1.
Compute the values yn for a = |, b = i, y0 = 1, y1 = 1.
Solution. Look first for the solution of the homogeneous equation (the right side being zero) in the form of rk. The number r must be a solution of the characteristic equation
x2 - a(l + b)x + ab = 0,   that is, x2 - x + ^ = 0,
which has a double root \. All the solutions of the homogeneous equation are then of the form a(^)n + bn(^)n.
Note also that if we find some solution of the non-homogeneous equation (the particular solution), then we can add to it any solution of the homogeneous solution, to obtain another solution of the non-homogeneous equation. It can be shown that all solutions of the non-homogeneous equation can be found in this way.
In this problem, it is easy to check that the constant function yn = c is a solution provided c = 4. All solutions of the difference equation
yk+2 - yk+i + ^ ■ yk = l
are thus of the form 4 + a(^)n + bn(^)n. We require that yo = Vi = 1 and these two equations give a = b = — 3.
the solution can be obtained as a linear combination of the derivatives of the characteristic polynomial starting with the expression (check the combinatorics!)
A'/^Qa'-1/"-1'!.
We have thus come close to the complete proof of the following:
Homogenous equations with constant coefficients
Theorem. The solution space of a homogeneous linear difference equation of order k over the field ofscalars K = C is the k-dimensional vector space generated by the sequences xn = nl\n, where A are (complex) roots of the characteristic polynomial and the powers £ run over all natural numbers 0,..., r\ — 1, where r\ is the multiplicity of the root A.
Proof. The relation between the multiplicity of roots and the derivatives of real polynomials will be proved later (cf. 5.3.7), while the fact that every complex polynomial has exactly as many roots (counting multiplicities) as its degree will appear in 10.2.11. It remains to prove that the fc-tuple of solutions thus found is linearly independent. Even in this case we can prove inductively that the corresponding Casora-tian is non-zero. We have done this already in the case of the Vandermonde determinant before.
To illustrate of our approach we show how the calculation looks for the case of a root Ai with multiplicity one and a root A2 with multiplicity two:
C(A?,A5,nA5;
= A™+1
xn+2
= A"A2re
A2
1,71+1
A2
177 + 2
A2
nAJ
(n + l)A™
^2
77+1
(n + 2)Xn+2
1
Ai A?
a™a:
n
(n+l)A2 (n + 2)A|
1
-A™ A2."
277 + 1
1
A2
A2 1
Ai — A2 Ai(Ai-A2)
Ai — A2 Ai(Ai — A2 (Ai-A2)2^0
n A2
A2
A2 x2
A2
In the general case the proof can be carried on inductively in a similar way. □
3.2.5. Real basis of the solutions. For equations with real coefficients, initial real conditions always lead to real solutions (and similarly with scalars Z or Q). However, the corresponding fundamental solutions derived using the above theorem might exist only in the complex domain.
155
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Thus the solution of this non-homogeneous equation is
^=4-3G)B-3nG)B-
Again, as we know that the sequence given by this formula satisfies the given difference equation and also the given initial conditions, it is indeed the only sequence characterized by these properties.
□
3.B.7. Find a sequence which satisfies the given non-homogeneous difference equation with the initial conditions:
xn+2 = xn+1 + 2xn + 1,   xx = 2, x2 = 2.
Solution. The general solution of the homogeneous equation is of the form a(—1)" + b2n. A particular solution is the constant —1/2. The general solution of the given non-homogeneous equation without initial conditions is thus
a(-l)n + b2n - i
Substituting in the initial conditions, then gives the constants a = —5/6, b = 5/6. The given difference equation with initial conditions is thus satisfied by the sequence
x - --f-ir + -2™-1 - -
Xn~   6[    '  +3 2
□
3.B.8. Determine the sequence of real numbers that satisfies the following non-homogeneous difference equation with initial conditions:
2xn+2 = -xn+1 + xn + 2,   xx = 2, x2 = 3.
Solution. The general solution of the homogeneous equation is of the form a(—1)" + b(l/2)n. A particular solution is the constant 1. The general solution of the non-homogeneous equation without initial conditions is thus
«(-1)"+ +
By substitution with the initial conditions, we obtain the constants a = 1, b = 4. The given equation with initial conditions is thus satisfied by the sequence
Xn = (-l)" + 4 +1.
□
We try therefore to find other generators, which will be more convenient. Because the coefficients of the characteristic polynomial are real, each of its roots is either real or the roots are paired as complex conjugates.
If we describe the solution in polar form as
A™ = | A |n (cos np + i sinnijs) A™ = | A |n (cos np — i sinnijs),
we see immediately that their sum and difference leads to two linearly independent solutions
xn = A|n cosnp,    yn = A|n siring.
Difference equations very often appear as a model of dynamics of some system. A nice topic to think about is the connection between the absolute values of individual roots and the stability of the solution. We will not go into details here, because only in the fifth chapter we will speak of convergence of values to some limit value. There is space for some interesting numerical experiments: for instance with oscillations of suitable population or economical models.
3.2.6. The non-homogeneous case. As in the case of sys-j77      terns of linear equations we can obtain all so-VJstfm?/    hitions of non-homogeneous linear difference equations
a0(n)xn + a1(n)xn_1 H-----h ak(n)xn_k = b(n),
where the coefficients a{ and b are scalars which might depend on n, with ao(n) ^ 0, ak(n) =^ 0. Again, we proceed by finding one solution and adding the complete vector space of dimension k of solutions to the corresponding homogeneous system. Indeed each such sum yields a solution. Since the difference of two solutions of a non-homogeneous system is a solution of the homogeneous system, we obtain all solutions in this way.
When we were working with systems of linear equations, it was possible that there was no solution. This is not possible with difference equations. But it is not always easy to find that one particular solution of a non-homogeneous system, particularly if the behaviour of the scalar coefficients in the equation is complicated. Even for linear recurrences with constant coefficients it may not be easy to find a solution if the right-hand side is complicated.
But we can always try to find a solution in a form similar to the right hand side. Consider the case when the corresponding homogeneous system has constant coefficients and b(n) is a polynomial of degree s. The solution can then be found in the form of the polynomial
xn = qq + a.\n + ■ ■ ■ + asns
with unknown coefficients a{, i = 1,..., s. By substitution into the difference equation and comparing the coefficients of the individual powers of n we obtain a system of s + 1 equations for s + 1 variables a,. If this system has a solution, then we have found a solution of our original problem. If
156
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.B.9.   Determine sequences satisfying
xn+2 - 6xn+1 + 5xn = n en .
Solution. Solve first the homogeneous part. We get:
4M = a ■ (1)™ + c2 ■ 5™.
To find the particular solution we can use the method of variation of the constant. The Wronski determinant is
Wj+1=det( [}+2   lJ+2 ) =4-y+1
Thus,
- n— 1 /n—1      . a
1 nt^ . „•     / nt^   J eJ
cre = ci + c2 ■ 5™ - -      3 eJ + XI
4        -   ■ i ^ 4-5J+1 j=o \j=o
5™,
ci,c2 G
□
3.B.10. Determine an explicit expression of the sequence satisfying the difference equation xn+2 = 3xn+i + 3xn with members x1 = 1 and x2 = 3. O
3.B.11. Determine an explicit formula for the n-th member of the unique solution {xn}^^ that satisfies the following conditions:
Xn+2 = a;n+i - xn, X1 = \, X2 = 5.
o
3.B.12. Determine an explicit formula for the n-th member of the unique solution {xn}n°=1 that satisfies the following conditions:
-xn+z = 2xn+2 + 2xn+1 + xn, xx = 1, x2 = 1, a;3 = 1.
o
3.B.13. Determine an explicit formula for the n-th member of the unique solution {xn}^^ that satisfies the following conditions:
-xn+3 = 3xn+2 + 3xn+1 + xn, x1 = 1, x2 = 1, a;3 = 1.
o
C. Population models
Population models, which we consider now, have recurrence relations in vector spaces. The unknown in this case is not a sequence of numbers but a sequence of vectors. The role of coefficients is played by matrices. We begin with a simple (two-dimensional) case.
it has no solution, we can try again with an increase in the degree s of the polynomial in question.
For instance, the equation xn — xn_2 = 2 cannot have a constant solution, because substitution of the potential solution xn = a0 yields the requirement a0 — a0 = 0 = 2. But by setting xn = a0 + a in we obtain a solution xn = a0 + n, with a0 arbitrary. Thus the general solution of our equation is
xn = Ci+C2(-l)n + n.
We use this method, the method of indeterminate coefficients for example in 3.B.6.
3.2.7. Variation of constants. An other possible way to i, solve such an equation is the variation of constants method. Here we find first a solution
k
y(n) = Ycifi(n) i=l
of the homogeneous equation, where we consider the constants Ci as functions c{ (n) of the variable n. Then we look for a particular solution of the given equation in the form
We illustrate the method on second order equations. Suppose that the homogeneous part of the second order non-homogeneous equation
Xn+2      dnXn+l      bnXn fn
has Xn** and x1^ as a basis of solutions. We will be looking for a particular solution of the non-homogeneous equation in the form
T    = A + R
^n      -ft-n^n     i ^n^n
with some conditions on An and Bn to be imposed. We have
Xn+l = An+lXn+1 + Bn+lXn+1 — AnXn+1 + BnXn+1+
(An+i - An)x^+1 + (Bn+i - Bn)x%
— A   -r^    4- R  -r^ -r^    4-AR -r^
— s±nXn+i -|- onxn+i -|- USinxn+i -|- UJDnxn+1,
where 8An = An+i - An and 8Bn = Bn+1 - Bn.
In order to be able to use the same An, Bn in the expression for xn+i, we impose for all n the condition
SAnx^+1 + SBnx^+1 = 0.
Thus, for all n
xn+i — AnxnJ_-^ + Bnxn^, and in particular
Xn+2 = An+1X^l2 +
— A   -r^    4- R  -r^    4- X A   -r^    4-AR -r^
— s±nXn+2 -|- onxn+2 -|- USinxn+2 -|- UJDnxn+2-
157
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.C.I. Savings. A friend and I save for a holiday together by monthly payments in the following way. At the beginning I give 10 €and he gives 20 €. Every consecutive month each of us gives as many as last month plus one half of what the other has given the month before. How much will we have after one year? How much money will I pay in the twelfth month?
Solution. Let the amount of money I pay in the n-th month be denoted by xn, and the amount my friend pays is yn. Thus in the first month we deposit x1 = 10, y± = 20. For the following payments we can write down a recurrent relation:
xn-\-l xn + ^Vn 2/n+l      Vn + ~2Xn
If we denote the common savings by zn = xn + yn, then by summing the equations we obtain zn+i = zn + \zn = \zn. This is a geometric sequence and we obtain zn = 3.(|)n_1. In a year we will have z1 + z2 + ■ ■ ■ + z12. This partial sum is easy to compute
3-i 12
.(f)
1
- 1
772,5.
In a year we will have saved over 772 €.
The recurrent system of equation describing the savings system can be written by matrices as follows:
in
Xn+l Vn+1
It is thus again a geometric sequence. Its elements are now vectors and the quotient is not a scalar, but a matrix. The solution can be found analogously:
n-l
The power of the matrix acting on the vector (x1,y1) can be found by expressing this vector in the basis of eigenvectors.
2 1
-0
The characteristic polynomial of the matrix is (1—A) and the eigenvalues are thus Ai2 = §, \ . The corresponding eigenvectors are thus (1,1) and (1, — 1). For the initial vector
(xi,yi) = (1, 2) we compute
J =2 (l)~2 (-1,
and thus
3 /3\™ 1 fl
1 nv1 /1
x,_ _
vyJ ~ 2 {2)     [lj    2 V2 That means that in the 12th month I pay
12     /-, \ 12 x12 = I - I   - I - I =130
-1
Now,
fn = xn+2 + anxn+i + bnxn+2
= An{Xn+2 + Hn^i+i + bnXn+2) + ^n(xn+2 + anxn+l + bnxn+2) + SAnX^2 + 3BnXn^_2 — (^^-™a;i+2 + ^nxn+2
Hence the variations SAn and SBn are subject to the systems
SAnx^+1 + SBnx% = 0
fiAnX^+2 + 3BnX^n^_2 — fn
with solutions (compute the inverse matrix e.g. by means of the algebraic adjoint and the determinant)
3An — An+l An
8Bn — Bn+1 — Bn
f x{2)
Jn-^n+l ~ Wn+1
wn+1
where Wn+i is the Wronski determinant
T(l) T(2)
Ln+2 ■Ln+2
It follows that
«-l f.T(2) A A J3X3+1
An-A0-^-^—
3=0
3=0
Setting A0 = B0 = 0 we obtain
3=0
(2) J-"j+l
wj+1
n-l f.T(l)
R   _ \^ J 3   } + l
and the aquired general solution of our recurrence equation is
xn = Cix<p + c242) +
(n-l      ,    (2)   \ (n-l  ,    (1) \
,3=0
,3=0
w.
This method is used to solve the example 3.B.9.
3.2.8. Linear niters. Now we consider infinite sequences
X      (. . . , X—n, x_n_j_i, . . . , X — x, Xq , X \ ,..., xn,. . . ).
As in the case of systems of linear equations, we work with an operation T that maps the sequence x to the sequence z = Tx with elements
zn = a0xn + a1xn_1 H-----h akxn_k.
158
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
€and my friend pays basically the same amount. □
Remark. The previous example can be solved also without matrices by rewriting the recurrent equation: xn+1 = xn +
2~Vri       2Xn + 2^n'
The previous example was actually a model of growth (in this case, growth of saved money). We go now to the models of growth describing primarily the growth of a population. The Leslie model of population growth with which we have dealt with in great detail in the theoretical part describes very well not only populations of sheep (according to which it was developed), but can be also applied in modelling of the following populations:
3.C.2. Rabbits for the second time. We show how the Leslie model can describe the population of the rabbits on a meadow with which we have worked in exercise (3.B.1). Suppose that the rabbits are dying after reaching the ninth year of age (in the original model the rabbits were immortal). Denote the number of rabbits according to their age in months at time t (in months) as x1(t), x2(t),. ..,xg(t). Then the number of rabbits in individual categories are described after one month by the formula x1(t+l) = x2(t)+x3(t)+- ■ ■ +xg(t), Xi(t + 1) = xi-^t), for i = 2,3,..., 10, or
x2(t + l)			1 0	1 0	1 0	1 0	1 0	1 0	1 0			/xi(t)\ x2(t)
x3(t + 1)		0	1	0	0	0	0	0	0	0		x3(t)
Xi(t + 1)		0	0	1	0	0	0	0	0	0		Xi(t)
x&(t + 1)	—	0	0	0	1	0	0	0	0	0		xs(t)
xe(t + 1)		0	0	0	0	1	0	0	0	0		xe(t)
x7(t + 1)		0	0	0	0	0	1	0	0	0		x7{t)
X,,(t + 1)		0	0	0	0	0	0	1	0	0		xg(t)
\xg(t + 1)/		Vo	0	0	0	0	0	0	1	0/		\xB(t)J
The characteristic polynomial of the given matrix is A9 —A7 — A6 — A5 — A4 — A3 — A2 — A — 1. The roots of this polynomial are hard to explicitly express, but we can estimate one of them very well - Ai = 1.608 (why must it be smaller than (v^ + l)/2)?). Thus the population grows according to this model approximately with the geometric sequence 1.608*.
3.C.3. Pond. Suppose we have a simple model of a pond where there lives a population of white fish (roach, bleak, vimba, nase, etc.). Assume that 20 % of babies survive their second year and from that age on they are able to reproduce. For these young fish, approximately 60 % of them survive their third year and in the following years the mortality can be ignored. Furthermore we assume that the birth rate is three times the number of fish that can reproduce.
Such a population would clearly fill the pond very quickly.  Thus we want to maintain a balance by using a
As already noticed, the sequences x = (xn) are vectors with respect to coordinate-wise operations, and the vector space of all such sequences ''''r&im^/i *s infinitely-dimensional. The operation T is -fer^%3»_L- clearly a linear mapping on this space.
The sequences can be imagined as discrete values of a signal, often captured in very short time units. T plays the role of a filter that works with the signal. For example, this is how the sampling of an audio signal looks like. We are interested in estimating the properties such a linear filter can have.
Signals are often a linear combination of superimposed parts, which are themselves periodical. From our definition it is clear that periodic sequences xn, that is, sequences satisfying for some fixed natural number p
Xn+p xn will also have periodic images z = Tx
= a0xn + a1xn_1 H----+ akxn_k = zn
with the same period p.
We are interested in which input periodic sequences Tx remain roughly the same (up to a scalar multiple), and in which Tx will be suppressed close to zero values. Also, we are looking for the kernel of our linear mapping T. That is, the subspace of sequences given by the homogeneous difference equation
a0xn + a1xn_1 H-----h akxn_k = 0,    a0 ^ 0   ak ^ 0,
which we are able to solve.
3.2.9. Bad equalizer. As an example, consider a very simple linear filter given by the equation
zn      (Tx)n     Xn -\- Xn—2-
Clearly, the kernel of T is generated by xn = cos(-|?i) and xn  =  sm(^n), while the solutions to 'X><^":' xn+2  = xn correspond to the requirement \S^^p'f   {Tx)n = 2xn. The results of such an operation jl%sb»-   on a signal are illustrated by the two diagrams below. There we use two different frequencies of signals and display their discrete sampling (the solid lines and the points xn on them). The dashed line represents the sampling zn of the filtered signal.
159
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
predator, for instance esox. Assume that one esox eats per year approximately 500 mature white fish. How many esox should be put into the pond in order for the population to remain constant?
Solution. If we denote by p the number of babies, by m the number of young fish and by r the number of adult fish, then the state of the population in the next year is given by:
(p\        / 3m + 3r \ m   m. 0.2p r J       \ 0.6m + rr J
where 1 — r is the relative mortality of the adult fish caused by the esox. The corresponding matrix describing this model is then
/ 0     3 3\
0.2    0 0 \0    0.6 t)
If the population is to stagnate, (ie. remain constant), then this
matrix must have eigenvalue 1. In other words, one must be
the root of the characteristic polynomial of this matrix. That
is of the form A2(r - A) + 0.36 - 0.6.(r - A) = 0. That
means that r must satisfy
r-l + 0.36-0.6(r-l) = 0 0.4r - 0.04 = 0
In the next year only 10 % is allowed to survive and the rest should be eaten by the esox. If we denote the desired number of esox by a, then together they eat 500a; fish, which, according to the previous computation, should be 0.9r. The ratio of the number of white fish to the number of esox should thus
be
500
x — 0 g . That is, one esox for (approximately) 556 white fish. □
3.C.4. In the population model, let the number of predators be Dk and the number of preys be Kk in month k. The relation of these between month k and month k +1 is given by one of the three linear systems
(a)
= 0.6 Dk + 0.5 Kk,
Kk+1   = -0.16 Dk + 1.2 Kk;
(b)
= 0.6 Dk + 0.5 Kk,
Kk+1   = -0.175 Dk + 1.2 Kk-
(c)
= 0.6 Dk + 0.5 Kk,
Kk+1   = -0.135 Dk + 1.2 Kk.
Analyse the behaviour of this model for large time values.
The first case shows an amplifying of the signal, while the second frequency is close to the kernel which is killed by the filter. Notice that the filtered signal suffers serious shifts in phase, which varies with the frequencies. Cheap equalisers work in such a bad way.
Notice also how badly the original signal is sampled on the second picture. This is due to the fact that the sampling frequency is not much higher than the frequency of the signal.
3. Iterated linear processes
3.3.1.
Iterated processes. In practical models we often encounter the situation where the evolution of a system in a given time interval is given by a linear process, and we are interested in the behaviour of the system after many iterations. The linear process often remains the same, thus from the mathematical point of view we are dealing with an iterated multiplication of the state vector by the same matrix.
While solving the systems of linear equation require only minimal knowledge of properties of linear mappings, in order to understand the behaviour of an iterated system, we shall exploit the features of eigenvalues, eigenvectors and other structural features.
In fact, the determination of the solution of a linear recurrence equation by a set of intital conditions can be described as an iterated process. Imagine we keep the state vector of the last n values
Yn    (an,... , an_fc_|_i)
(filled by the initial condition in the beginning of the process). In the next step we update the state vector
Y
n+l
I xn-k+2),
where the first entry xn+i = a\xn + ■ ■ ■ + akxn_k+1 is computed by means of a homogeneous difference equation, while the other entries are just a shift by one position with the last one forgotten. The corresponding square matrix of order k that satisfies Yn+1 = A ■ Yn is as follows:
fax 1
A-
a-2 0
ak-x 0
0 1
ak\ 0
0
\0    0   ...     1 0/
A while ago, we derived an explicit procedure for the complete formula for the solution of such an iterated process with a special type of matrix. In general, it will not be easy even
160
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution. Note that the individual variants differ from each other only in the value of the coefficients at Dk in the second equation. Thus we can express all three cases as
where we set a = 0.16, a = 0.175, a = 0.135. The value of the coefficient a represents here the average number of preys killed by one predator per month. When denoting
'0.6 0.5'
T =
-a 1.2
Dk\=TK(D0]     k(_ R
we obtain
Using the powers of the matrix T we can determine the evolution of the populations of predators and prey after a very long time.
We compute the eigenvalues for the matrix T
(a) Ax = 1,   A2 = 0.8;
(b) Ai = 0.95,   A2 = 0.85;
(c) Ai = 1.05,   A2 = 0.75. The respective eigenvectors are
(a) (5,4)T, (5,2)T;
(b) (10,7)T, (2,1)T;
(c) (10,9)T, (10,3)T. For k e N we obtain (a)
rpK
5   5\   (1    0\k   (h   5X 1
4   2     \0   0.8      \4 2
(b)
(c)
nk    AO   2\   /0.95     0 \*   [10   2' 1
rpK _
7 1
0 0.85
7 1
T
,k _ J 10   10\   A.05     0 \k  (10   10> 1 9    3 ) ' [  0     0.75 )  ' [ 9 3
From there we have for large k e N that (a)
rj-iK
5 5\ (1 0\ (5 5 4   2) ' [o   0J ' \4 2
1_ (-10 25 10 I -8 20
for very similar systems. A typical case is the study of the dynamics of populations in some biological systems which we discuss below.
The characteristic polynomial | A — A E\ of our matrix is
p(A) = (-l)fc(Afc-a1Afc-1-----ak),
as we can check directly or by expanding the last column and employing induction on k.
Thus, the eigenvalues are exactly the roots A of the characteristic polynomial of the linear recurrence. We should have expected this, because having a nonzero solution xn = \n to the linear recurrence means that the matrix A must bring (Afe,..., A)T to its A-multiple. Thus every such A must be eigenvalue of the matrix A.
3.3.2. Leslie model for population growth. Imagine that we are dealing with some system of individuals (cattle, insects, cell cultures, etc.) divided into m groups (according to their age, evolution stage, etc.). The state Xn is thus given by the vector
Xn = (u1,.. .,um)T
depending on the time tn in which we are observing the system. A linear model of evolution of such system is then given by the matrix A of dimension n, which gives the change of the vector Xn to
X„
n+l
= A-X„
(h	h	h ■	fm — 1     fm ^
Tl	0	0 .	0 0
0	T2	0 .	0 0
0	0		0 0
U	0	0 .	•     Tm-1       0 /
when time changes from tn to tn+i.
As an example, we consider the Leslie model for population growth. Here there is the matrix
A =
whose parameters are tied with the evolution of a population divided into m age groups such that /, denotes the relative fertility of the corresponding age group (in the observed time shift from N individuals in the i-th group arise new fcN ones - that is, they are in the first group), while t{ is the relative mortality in the i-th group in one time interval. Clearly such a model can be used with any number of age groups.
All coefficients are thus non-negative real numbers and the numbers t{ are between zero and one. Note that when all r are equal one, it is actually a linear recurrence with constant coefficients and thus has either exponential growth/decay (for real roots A of the characteristic polynomial) or oscillation connected with potential growth/decay (for complex roots).
Before we introduce a more general theory, we consider in more detail this specific model.
Direct computation with the Laplace expansion of the last column yields the characteristic polynomial pm (A) of the
161
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(b)
rj-ik
10 2 7 1
0 0\
0   0 '
0 0 0 0
10 2 7 1
(c)
rj-ik
10 10 9 3
10 10 9 3
1.05fe 0 0 0
_ 1.05fc /-30 100
60   V-27 90
because for large k e N we can set
(a)
(1 0\^/l 0^
^0 0.8 )   ~ \0 0;
(b)
(0.95 0 0^
\ 0 0.85)   ~ \ 0 0,
(c)
'i.05 o V   (imk o^
0     0.75)   ~ \  0 0y Note that in variant (b), that is for a = 0.175, it is not necessary to compute the eigenvectors. Thus we have
(a)
DA _ 1_ (-10 25\ {D0 Kk)~l0\-8 20j'\K0
10 U (-2D0 + 5K0
(b)
(c)
Dk Kk
0 0 0 0
D0 Ko
Dk\ 1.05k ^-30 100 Kk
Do Ko
60   V~27 90 1.05fc (10 (-3D0 + 10K{
o,
60   V 9 (-3Do + WKo) These results can be interpreted as follows:
(a) If 2D0 < 5K0, the sizes of both populations stabilise on non-zero sizes (we say that they are stable); if 2Dq > 5Kq, both populations die out.
(b) Both populations die out.
(c) For 3D0 < 10K0 begins a population boom of both kinds; for 3D0 > 10K0 both populations die out.
matrix A for the model with m groups:
Prrc(A) = -Apm_i(A) + (-l)m-l/m1-l ■■■7-m-l-
By induction we derive that this characteristic polynomial is of the form
pm(A) = (-lHAm-aiA"
-lA — dm).
The coefficients ai,..., am, are all positive if all parameters t{ and fi are positive. In particular,
fmTl ■
■ Tm-1-
Consider the distribution of the roots of the polynomial J.',, pm. We write the characteristic polynomial in the form
pm(A) = ±\m(l - g(A))
where g(A) = a1\~1 + ■ ■ ■ + am\~m is a strictly decreasing non-negative function for A > 0. For A positive but very small the value of q will be arbitrarily large, while for large A, it will be arbitrarily close to zero. Thus, evidently there exists exactly one positive A for which g(A) = 1 and thus also Pm (A) = 0. In other words, for every Leslie matrix (with all the parameters /, and t{ positive), there exists exactly one positive real eigenvalue. For actual Leslie models of populations a typical situation is when the only real eigenvalue Ai is greater or equal to one, while the absolute values of the other eigenvalues are strictly less than one.
If we begin with any state vector X, given as a sum of eigenvectors
X = X\ + ■ ■ ■ + Xm with eigenvalues A,, then iterations yield
Ak ■ X = \kXi + ... A^lm.
Thus under the assumption that |Aj| < 1 for all i > 2, all components in the eigensubspaces decrease very fast, except for the component \iXk.
The distribution of the population among the age groups are thus very fast approaching the ratios of the components of the eigenvector to the dominant eigenvalue Ai.
As an example, consider the matrix below where individual coefficients are taken from the model for sheep breeding, that is, the values r contain both natural deaths and activities of breeders.
A-
( 0
0.95 0 0
V o
0.2
0
0.8
0 0
0.8
0 0
0.7
0
0.6
0 0 0
0.6
0 0 0
0/
The eigenvalues are approximately
1.03, 0, -0.5, -0,27 + 0.74i, -0.27- 0.74i
with absolute values 1.03, 0, 0.5, 0.78, 0.78 and the eigenvector corresponding to the dominant eigenvalue is approximately
XT = (30 27 21 14 8).
162
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Even a small change in the size of a can lead to a completely different result. This is caused by the constancy of the value of a: it does not depend on the size of the populations. Note that this restriction (that is, assuming a to be constant) has no interpretation in reality. But still we obtain an estimate on the sizes of a for stable populations. □
3.C.5. Remark. Another model for the populations of predators and preys is the model by Lotka and Volterra, which describes a relation between the populations by a system of two ordinary differential equations. Using this model both populations oscillate, which is in accord with observations.
Other interesting and well-described models of growth can be found in the collection of exercises after this chapter, (see 3.G.2
In linear models an important role is played by primitive matrices (3.3.3).
3.C.6.   Which of the matrices
A =
1/7 6/7
C ■
B =
D
1
E -
0 0 0 0 0 0 1
'1/2 0 l/3\
0 11/2,
,1/2 0 1/6/
A/3 1/2    0 0 \
1/2 1/3    0 0
0 1/6   1/6 1/3
\l/6 0    5/6 2/3/
0 0\
1 0
0/
are primitive? Solution.
Az =
A/7 6/49 \ ^6/7 43/49
C3 =
1/4 l/4> 3/8 1/4 3/8 1/2;
So the matrices A and C are primitive, since (respectively) A2 and C3 are positive matrices. The middle column of the matrix B71 is always (for n e N) the vector (0,1,0)T which contains the entry 0. Hence the matrix B cannot be primitive. The product
, a, b G
a, b e K, implies that the matrix D2 has in the right upper corner a zero two-dimensional (square) sub-matrix. By induction, the same property is shared by the matrices D3 = D-D2,
A/3	1/2	0	°)		(°)		(  0 \
1/2	1/3	0	0		0		0
0	1/6	1/6	1/3		a		a/6 + &/3
\l/6	0	5/6	2/3/		w		\5a/Q + 2b/3/
We have chosen the eigenvector whose coordinates sum to 100, thus it directly gives us the percentage distribution of the population.
Suppose instead that we wish for a constant population, and that one year old sheep are removed for consumption. Then we need ask how to decrease r2 so that the dominant eigenvalue would be one.
A direct check shows that the farmer could then eat about 10% more of one year old sheep to keep the population constant.
3.3.3. Matrices with non-negative elements. Real matrices which have no negative elements have very special properties. They are very often present in practical models. Thus we introduce the ^f^^s-j— Perron-Frobenius theory which deals with such matrices. Actually, we show some results of Perron, we omit the more general situations due to Frobenius.2
We begin with some definitions in order to formulate our ideas.
Positive and primitive matrices
Definition. A positive matrix means a square matrix A all of whose elements a{j are real and strictly positive. A primitive matrix is a square matrix A whose power Ak is positive for some positive k e N.
Recall that spectral radius of a matrix A is the maximum of absolute values of all (complex) eigenvalues of A. The spectral radius of a linear mapping on a (finite dimensional) vector space coincides with the spectral radius of the corresponding matrix for some basis.
In the sequel, the norm of a matrix 4 £ I™ or of a vector i el" will mean the sum of the absolute values of all elements. For a vector x we write \x\ for its norm.
The following result is very useful and hopefully understandable. But the difficulty of its proof is rather not typical for this textbook. If you prefer, readjust the theorem and skip the proof till later on.
Perron Theorem
Theorem. If A is a primitive matrix with spectral radius A £ R, then A is a root of the characteristic polynomial of A with multiplicity one and A is strictly greater than the absolute value of all other eigenvalues of A. Furthermore, there exists an eigenvector x associated with A such that all elements xi of x are positive.
j Proof. We shall present rather a sketch of the proof and we shall rely on intuition from elementary geometry.
Oskar Perron and Ferdinand Georg Frobenius were two great German mathematicians at the break of the 19th and 20th centuries. Even in this textbook we shall meet their names in Analysis, Number Theory, Algebra. Look up the index.
163
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
D4 = D-D3,...,Dn = D ■ Dn~x,thus the matrix D is not primitive. The matrix E is a permutation matrix (in every row and every column there is exactly one non-zero element, 1). It is not difficult to see that a power of a permutation matrix is again a permutation matrix. Thus the matrix E is also not primitive. This is easily verified by calculating the powers E2, E3, E4. The matrix E4 is a unit matrix. □
D. Markov processes
3.D.I. Sweet-toothed gambler. A gambler bets on a coin -whether a flip results in a head or in a tail. At the start of the game he has three sweets. On every flip, he bets on a sweet. If he wins, he gains one additional sweet. If he looses, he looses the sweet. The game ends when he loses all sweets or has at least five sweets. What is the probability that the game does not end after four bets?
Solution. Before the j-th round we can describe the state of the player by the random vector Xj =
(po(J),Pi(J),P2(j),P3(J),P4(j),P5(J)), where p{ is the probability that the player has i sweets. If the player has before the j-th bet i sweets (i = 2, 3,4), then after the bet he has (i — 1) sweets with probability 1/2, and he has (i + 1) sweets with probability 1/2. If he attains five sweets or loses them all, the number of sweets does not change. The vector Xj+1 is then obtained from the vector Xj by multiplying it with the matrix
A :
At the start,
0
0
1
0
W
After four bets the situation is described by the vector
3/16
X5 = A4X1= 5/lg 0
\3/8j
n	0.5	0	0	0	
0	0	0.5	0	0	0
0	0.5	0	0.5	0	0
0	0	0.5	0	0.5	0
0	0	0	0.5	0	0
V>	0	0	0	0.5	y
X, =
Notice that the matrices A and Ak share the eigenvectors, while the corresponding eigenvalues are A and \k respectively. Thus the assertion of the theorem holds if and only if the same is true for Ak. In particular, we may assume the matrix A itself is positive, without any loss of generality. Many of the necessary concepts and properties will be discussed in chapter four and in the subsequent chapters devoted to analytical aspects, so the reader might come back to this proof later. The first step is to show the existence of an eigenvector which has all elements positive. Consider the standard simplex
S = {x = (x1:... ,xn)T, \x\ = l,xt > 0, i = 1,. .. ,n}.
Since all elements in the matrix A are positive, the image Ax for x e S has all coordinates positive too. The mapping
x ^ \A- x\~1(A ■ x)
thus maps S to itself. This mapping S —> S satisfies all the assumptions of the Brouwer fixed point theorem3 and thus there exists vector y e S such that it is mapped by this mapping to itself. That means that
A-y = Xy, X=\A-y\
and we have found an eigenvector that lies in S. By assumption, A ■ y has got all coordinates positive, thus y must have the same property. Moreover, A > 0.
In order to prove the rest of the theorem, we consider the mapping given by the matrix A in a more suitable basis, where the coordinates of the eigenvector would be (A,..., A). Moreover, We multiply the mapping by by the constant A-1. Thus we work with the matrix B,
B = \~1(Y~1 ■ A ■ Y),
where Y is the diagonal matrix with coordinates of the above eigenvector y on its diagonal. Evidently B is also a positive matrix. By the construction, the vector z = (1,..., 1)T is its eigenvector with eigenvalue 1, because Y ■ z = y.
It remains to prove that fi = 1 is a simple root of the characteristic polynomial of the matrix B and that all other roots have absolute value strictly smaller than one. Then the proof of the Perron thoerem is finished.
In order to do that we use an auxiliary lemma. Consider for the moment the matrix B to define the linear mapping that maps the row vectors
u = (ui,..., un) n> u ■ B = v,
that is, using multiplication from the right (i.e. B is viewed as the matrix of a linear map on one-forms). Since z = (1,..., 1)T is an eigenvector of the matrix B, the sum of the coordinates of the row vector v
This theorem is a great example of a blend of (homological) Algebra, (differential) Topology and Analysis. We shall discuss it in Chapter 9, cf. ?? on page ??.
164
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
that is, the probability that the game ends in the fourth bet or sooner is one half.
Note that the matrix A describing the evolution of the probabilist vector X is itself probabilistic, that is, in each column the sum is one. But it does not have the property required by the Perron-Frobenius theorem. By a simple computation you can check (or you can see it straight without any computation) that there exist two linearly independent eigenvectors corresponding to the eigenvalue 1. These correspond to the case that the player has no sweet, that is a; = (1,0,0, 0,0,0)T, or to the case when the player has 5 sweets and the game thus ends with him keeping all the sweets, that is, x = (0,0,0,0,0,1)T. All other eigenvalues (approximately 0.8, 0.3, —0.8, —0.3) are in absolute value strictly smaller than one. Thus the components in the corresponding eigensub-spaces with iteration of the process with arbitrary initial distribution vanish and the process approaches the limiting value of the probabilistic vector of the form (a,0,0,0,0,l — a), where the value a depends on the initial number of sweets. In our case it is a = 0.4, if there were 4 sweets at the start, it would be a = 0.2 and so on. □
3.D.2. Car rental. A company that rents cars every week has two branches - one in Prague and one in Brno. A car rented in Brno can be returned in Prague and vice versa. After some time it has been discovered that in Prague, roughly 80 % of the cars rented in Prague and 90 % of the cars rented in Brno are returned there. How to distribute the cars among the branches such that in both there is at the start of the week always the same number of cars as in the week before? How will the situation look like after a long time, if the cars are distributed at the start in a random way?
Solution. Denote the components of the vector in question, that is, the initial number of cars in Brno and in Prague by xb and xp respectively. The distribution of the cars between branches is then described by the vector x = ( XB ). If we consider such a multiple of the vector x such that the sum of its components in 1, then its components give the percentage distribution of the cars. According to the statement, the state at the end of the week is described by the vector
An o.2\ fxB\ ^      . ,    An o.2\ , J
V0.9 0.8){xP)-ThemamxA = [0.9 0.8jthusde-scribes our (linear) system of car rental. If at the end of the
week in the branches there should be the same number of cars
as at the beginning, we are looking for such a vector x for
whenever u e S. Therefore the simplex S maps onto itself and thus has in S a (row) eigenvector w with eigenvalue one (a fixed point, by the Brouwer theorem again). Because some power B is positive by our assumption, the image of the simplex S under B lies inside of S.
We continue with the row vectors. Denote by P the shift of the simplex S into the origin by the eigenvector w we have just found. That is, P = — w+S. Evidently P is a set containing the origin and is denned by linear inequalities. Moreover, the vector subspace VcR" generated by P is invariant with respect to the action of the matrix B through multiplication of the row vectors from the right. Restriction of our mapping to P, and P itself satisfy the assumptions of the auxiliary lemma proved below and thus all its eigenvalues are strictly smaller than one.
Now, the entire space decomposes as the sum R" = V ffi span{w} of invariant subspaces, w is the eigenvector with eigenvalue 1, while all eigenvalues of the restriction to V are strictly smaller in absolute value.
The theorem is nearly proved. We have just to consider the problem that the mapping under question was given by multiplication of the row vectors from the right with the matrix B, while originally we were interested in the mapping given by the matrix B and multiplication of the column vectors were from the left. But this is equivalent to the multiplication of the transposed column vectors with the transposed matrix B in the usual way - from the left. Thus we have proven the claim about eigenvalues for the transpose of B. But transposing does not change the eigenvalues and so the proof is complete. □
A bounded polyhedron in R™ is a nonempty subset defined by linear inequalities, sitting in some large enough ball. Simplex S from the proof or any its translation are examples.
Lemma. Consider any bounded polyhedron P C Rn, containing a ball around origin 0 G Rn. If some iteration of the linear mapping ip : Rn —> Rn maps P into its interior (that isip(P') C P and the image does not intersect with the boundary), then the spectral radius of the mapping ip is strictly less than one.
Proof. Consider the matrix A of the mapping ip in the standard basis. Because the eigenvalues of Ak are the fc-th powers of the eigenvalues of the matrix A, we may assume (without loss of generality) that the mapping ip already maps P into P. Clearly ip cannot have any eigenvalue with absolute value greater than one.
We argue by contradiction and assume that there exists an eigenvalue A with |A| = 1. Then there are two possibilities, either \k = 1 for suitable k or there is no such k.
The image of P is a closed set (that means that if the points in the image ip(P) get arbitrarily close to some point y in R™, then the point y is also in the image - this is a general feature of the linear maps on finite dimensional vector spaces). By our assumption, the boundary of P does not intersect with the image. Thus ip cannot have a fixed point on the boundary
165
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
which that Ax = x. That means that we are looking for an eigenvector of the matrix A associated with the eigenvalue 1.
The characteristic polynomial of the matrix A is (0.1 — A) (0.8 - A) - 0.9.0.2 = (A - 1)(A + 0.1) and 1 is indeed an eigenvalue of the matrix A. The corresponding eigenvector satisfies the equation^ _^2)
0. It is thus a multiple of the vector        ■ F°r determining
the percentage distribution we are looking for a multiple such
i A).2>
that xb+xp = 1. That is satisfied by the vector ri ( q g
0.18
The suitable distribution of the cars between Prague
0.82,
and Brno is such that 18% of the cars are in Brno and 82% of
the cars are in Prague.
If we choose arbitrarily the initial state x = J, then the state after n weeks is described by the vector xn = Anx. It is useful to express the initial vector x in the basis of the eigenvectors of A. The eigenvector for the eigenvalue 1 has already been found. Similarly we find eigenvectors for the
eigenvalue —0.1. That is for instance the vector
-1 1 )'
The initial vector can be expressed as a linear combination x = afQ'gj+&f ^ }. The state after n weeks is then
;o.i8\ ,(-\
xn = An(a ( 0g2 ] +b
The second summand is approaching zero for n —> oo. Thus
^0.18^
the state stabilises at a
0.82
That is, the coordinate of
the initial vector at the direction of the first eigenvector. The coefficient can be easily expressed using the initial states of the cars: a = ^f^- □
3.D.3. In a certain game you can choose one of two opponents. The probability that you beat the better one is 1/4, while the probability that you beat the worse one is 1/2. But the opponents cannot be distinguished, thus you do not know which one is the better one. You await a large number of games. For each of them you can choose a different opponent. Consider the following two strategies:
1. For the first game choose the opponent randomly. If you win a game, carry on with the same opponent; if you lose a game, change the opponent.
and there cannot even be any point on the boundary to which some sequence of points in the image would converge.
The first argument excludes that some power of A is one, because such a fixed point of ipk on the boundary of P would then exist and thus it would be in the image. In the remaining case there would be a two-dimensional subspace If C R" on which the restriction of t/j acts as a rotation by an irrational angle and thus there exists a point y in the intersection of W with the boundary of P. But then the point y could be approached arbitrarily close by the points from the set i>k(y) (through all iterations) and thus would have to be in the image too. This leads to a contradiction and thus the lemma is proved. □
3.3.4. Simple corollaries. Once we know the Perron theorem, the following very useful claim has a surprisingly simple proof. It shows how strong is the primitivity assumption of a matrix.
Corollary. If A = (a^) is a primitive matrix and x G W1 is its eigenvector with all coordinates non-negative and eigenvalue A, then A > 0 is the spectral radius of A. Moreover,
n n
mmj6{i,...,n} y^aij < A < maXj6{lv..tTt}
i=l i=l
Proof. Because A is primitive, we can choose k such that Ak has only positive elements. Then Ak ■ x = \kx is a vector with all coordinates strictly positive. Obviously A > 0.
According to the Perron theorem, the spectral radius fi of A is an eigenvalue and the associated eigenvectors y have positive coordinates only. Thus we may choose such an eigenvector with the property that the difference x — y has only strictly positive coordinates. Then for all (positive integer) powers m we have
0 < Am ■ (x - y) = Xmx - nmy, but also A < [i. If [i = A + a, a > 0, then
0 < Xmx - (A + a)my < Xm(x -y- m^y)
X
which is clearly negative for m large enough. Hence A = fi.
It remains to estimate the spectral radius using the minimum and maximum of sums of individual columns of the matrix. We denote them by &min and &max- Choose x to be the eigenvector with the sum of coordinates equal to one and count:
y' ciijXj = y' Xxi = x
n    / n
3 = 1 ^i=l       ' 3=1
n    / n
x = Y(Yav)x3^Yb
3 = 1 Ki=l       J 3 = 1
,mm*£<7 ^min-
□
166
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
2. For the first two games, choose an opponent randomly. Then for the next two games, if you lost both the previous games, change the opponent, otherwise stay with the same opponent.
Which of the two strategies is better?
Solution. Both strategies define a Markov chain. For simplicity denote the worse opponent by A and the better opponent by B. In the first case for the states "game with A" and "game with j5" (in this order), we obtain the probabilistic transition matrix
'1/2 3/4^
vl/2 1/4,
This matrix has all of its elements positive. Thus it suffices to find the probabilistic vector x^, which is associated with the eigenvalue 1. We compute
5' 5
Its components correspond to the probabilities that after a long sequence of games the opponent is the player A or player B. Thus we can expect that 60 % of the games will be played against the worse of the two opponents. Because
2     3   1    2 1 5 ~ 5 ' 2 + 5 ' 4'
there will be roughly 40 % against the better of the two opponents.
For the second strategy, use the states "two games in a row with A" and "two games in a row with B" which lead to the probabilistic transition matrix
3/4 9/16\ 1/4 7/16,1'
It is easily determined that now
9 4 13' 13
Against the worse opponent one would then play (9/4)-times more frequently than against the better one. Recall that for the first strategy it is (3/2)-times more frequently. The second strategy is thus better. Note also that for the second strategy, roughly 42,3 % of the games are winning ones. It suffices to enumerate
11     9   1     4 1
0.423 = — =---+---.
26     13   2    13 4
□
Note that for instance all Leslie matrices from 3.3.2, as soon as all their parameters /, and tj are strictly positive, are primitive. Thus we can apply the just derived results to them. (Compare this with the ad hoc analysis of the roots of the characteristic polynomial from 3.3.2)
3.3.5. Markov chains. A very frequent and interesting case of linear processes with only non-negative elements in a matrix is a mathematical model of a system which can be in one of m states with various probabilities. At a given point of time the system is in state i with probability x{. The transition form the state i to the state j happens with probability Uj.
We can write the process as follows: at time n the system is described by the stochastic vector (we also say probability vector) xn = (ui(n),... ,um(n))T.
This means that all components of the vector x are real non-negative numbers and their sum equals one. Components give the distribution of the probability of individual possibilities for the state of the system. The distribution of the probabilities at time n + 1 is given via multiplication by the transition matrix T = (Uj), that is,
•^n+l      T ' Xn.
Since we assume that the vector x captures all possible states of the system and moves again to some of these states with the total probability one, all columns of T are also given by stochastic vectors. We call such matrices stochastic matrices. Note that every stochastic matrix maps every stochastic vector a; to a stochastic vector Tx again:
tijxj—5^ (5^ ^) xj ~ 5^ xj ~
ij j        i j
Such a sequence xn+i = Txn is called a (discrete) Markov process and the resulting sequence of vectors x0, x1,... is called a Markov chain xn.
Now we can exploit the Perron-Frobenius theory in its full power. Because the sum of the rows of the matrix is always equal to the vector (1,..., 1), we see that the matrix T — E is singular and thus one is an eigenvalue of the matrix T. Furthermore, if T is a primitive matrix (for instance, when all elements are non-zero), we know from the corollary 3.3.4 that one is a simple root of the characteristic polynomial and all others have absolute value strictly smaller than one. This leads to:
Ergodic Theorem
Theorem. Markov processes with primitive matrices T satisfy:
• there exists a unique eigenvector ioo of the matrix T with the eigenvalue 1, which is stochastic,
• the iterations TkXQ approach the vector Xoo for any initial stochastic vector x0.
167
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.D.4. Absent-minded professor. Consider the following situation. An absent-minded professor carries an umbrella with him, but with probability 1/2 he forgets it from wherever he is leaving.
In the morning, he leaves home to go to his office. From his office, he goes for lunch at a restaurant, and then goes back to his office. After he is finished with his work at the office, he leaves for home. Suppose (for simplicity) that he does not go anywhere else. Suppose also that if he leaves it in the restaurant, it will remain there until the next time. Consider this situation as a Markov process and write down its matrix. What is the probability that after many days in the morning the umbrella is located in the restaurant? It is convenient to choose one day as a time unit: from morning to morning.
Solution.
/11/16 3/8 l/4\
A =    3/16 3/8 1/4
V 1/8 1/4 1/2/
Compute the element a\, that is, the probability that the umbrella starts its day at home and stays there, that is, it will be there the next morning. There are three distinct possibilities for the umbrella:
D the professor forgets it when leaving home in the morning
pi = h
DPD the professor takes it to the office, then he forgets to take it on to lunch and in the evening he takes it home: p2 =
1 I  1 = 1
2 ' 2 ' 2 8'
DPRPD the professor takes the umbrella with him all the time and does not forget it anywhere: Pz = \ ■ \ ■ \ ■ \ = jq-
In total a{=pi+p2+p3 =
The eigenvector of this matrix corresponding to the dominant eigenvalue 1 is (2,1,1), and thus the desired probability
is 1/(2 + 1 + 1) = 1/4. □
3.D.5. Algorithm for determining the importance of pages. Internet browsers can find (almost) all pages containing a given word or phrase on the Internet. But how can a user sort the pages such that a list is sorted according to the relevance of the given pages? One of the possibilities is the following algorithm: the collection of all found pages is considered to be a system, and each of the found pages is one of its states. We describe a random walk on these pages as a Markov process. The probabilities of transitions between pages are given by the hyperlink: each link, say from page
Proof. The first claim follows directly from the positiv-J.i,, ity of the coordinates of the eigenvector derived in the Perron theorem.
Next, assume that the algebraic and geometric multiplicities of the eigenvalues of the matrix T are the same. Then every stochastic vector x0 can be written (in the complex extension Cn) as a linear combination
%o = c\Xoo + c2y2 H-----h cnyn,
where y2 ... ,yn extend x^ to abasis of the eigenvectors. But then the fc-th iteration gives again a stochastic vector
xk = Tk ■ x0 = ci^oo + X2c2y2 H-----h Xkcnyn.
Now all eigenvalues A2, ■ ■ ■ \n are in absolute value strictly smaller than one. So all components of the vector xk but the first one approach (in norm) zero. But xk is still stochastic, thus the only possibility is that c\ = \ and the second claim is proved.
In fact, even if the algebraic and geometric multiplicities of eigenvalues do not coincide we reach the same conclusion using a more detailed study of the root subspaces of the matrix T. (We meet them when discussing the Jordan matrix decomposition later in this chapter.) Consequently, even in the general case the eigensubspace spanjxoo} comes with the unique invariant (n — 1)-dimensional complement, on which are all eigenvalues in absolute value smaller than one and the corresponding components in xk approach zero as before. See the note 3.4.11 where we finish this argument in detail. □
3.3.6. Iteration of the stochastic matrices. We reformulate the previous theorem into a simple, but surprising result. By convergence to a limit matrix in the following theorem we mean the following: if we say that we want to bound the possible error e > 0, then we can find a lower bound on the number of iterations k after which all the components of the matrix differ from the limit one by less than e.
Corollary. Let T be a primitive stochastic matrix from a Markov process and let be the stochastic eigenvector for the dominant eigenvalue 1 (as in the Ergodic Theorem above). Then the iterations Tk converge to the limit matrix T^,, whose columns all equal to x^.
Proof. Columns in the matrix Tk are images of the vectors of the standard basis under the corresponding iterated linear mapping. But these are images of the stochastic vectors and thus they all converge to x^. □
Before leaving the Markov processes, we think about their more general versions with matrices which are not primitive. Here we would need the full Frobenius-Perron theory. Without going into technicalities, consider a process with a block wise diagonal or an upper triangular matrix T,
T :
R
168
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
A to page B, determines the probability (l/(total number of links from the page A)), with which the process moves from page A to page B. If from some page there are no leading links, we consider it to be a page from which a link leads to every other page. This gives a probabilistic matrix M (the element m{j corresponds to the probability with which we move from the i-th page to the j-th page). Thus if one randomly clicks on links in the found pages (and from a linkless page one just chooses randomly the next one) the probability that at a given time (sufficiently large from the beginning) one is located on the i-th page corresponds to the i-th component of the unit eigenvector of the matrix M, corresponding to the eigenvalue 1. Looking at the sizes of these probabilities we define the importance of the individual pages.
This algorithm can be modified by assuming that users stop clicking from a link to a link after certain time and again starts on a random page. Suppose that with probability d he chooses a new page randomly, and with probability (1 — d) keeps on clicking. In such a situation the probability of transition between any two pages Si and Sj is non-zero - it is d/n + (1 — d) /total number of links at the page Si if from Si there is a link to Sj, and d/n otherwise (if there are no links at Si, then it is 1 /n). According to the Perron-Frobenius theorem the eigenvalue 1 is with multiplicity one and dominant, and thus the corresponding eigenvector is unique (if we chose transitional probabilities only as described in the previous paragraph, it would not have to be so).
For an illustration, consider pages A, B, C and D. The links lead from A to B and to C, from B to C and from C to A, from D nowhere. Suppose that the probability that the user chooses a random new page is 1/5. Then the matrix M looks as follows:
M
A/20	1/20	17/20	l/4\
9/20	1/20	1/20	1/4
9/20	17/20	1/20	1/4
\l/20	1/20	1/20	1/4/
and imagine first that P, Q are primitive and R = 0. Here we can again apply the above results block wise. In words, if we start in a stay x0 with all probability concentrated in the first four coordinates, the process converges to the value x^ which again has all the probability distributed among the first block of coordinates, and the same for the other block.
If R > 0 then we can always jump to the states corresponding to the first block from those in the second block with a non-zero probability and the iterations get more complicated:
rj-tZ _
rj->6 _
P2   P-R + R-0 Q2
P3 P2-R + P-R-Q + R-0 Q3
An interesting special case is when P = E and R is positive. Then Q — E must be a regular matrix and a simple computation yields the general iteration (notice E and Q commute and thus (E - Q)(E + Q -\-----hQfe_1) = E - Qk)
rj-iK _
E R(E - Q)_1(_E - i 0 Qk
Thus, the entire first block of states is formed by eigenvectors with eigenvalue 1 (so these states stay constant with probability 1), while the behavior on the other block is more complicated.
4. More matrix calculus
We have seen that understanding the inner structure of matrices is a strong tool for both computation and analysis. It is even more true when considering numerical calculations with matrices. Therefore we return now to the abstract theory.
We introduce special types of linear mappings on vector spaces. We consider general linear mappings whose structure is understood in terms of the Jordan normal form (see 3.4.10). In all these cases, complex scalars are essential. So we extend our discussion of scalar product to complex vector spaces. Actually, in many areas the complex vector spaces are the essential platform necessary for introducing the mathematical models. For instance, this is the case in the so-called quantum computing, which became a very active area of theoretical computer science. Many people hope to construct an effective quantum computer soon.
The eigenvector corresponding to the eigenvalue 1 is (305/53,175/53,315/53,1), the importance of the pages is thus given according to the order of the sizes of the corresponding components, that is, C > A > B > D.
Another various applications of the Markov chains are in the additional exercises after this chapter, see 3.G.3
3.4.1. Unitary spaces and mappings. The definitions of scalar product and orthogonality easily extend to the complex case. But we do not mean the complex bilinear symmetric forms a, since there the quadratic expressions a(v, v) are not real in general and thus we would not get the right definition of length of vectors. Instead, we define:
169
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
£. Unitary spaces
In the previous chapter we denned the scalar product for real vector spaces (2.3.18). In this chapter we extend its definition to the complex spaces (3.4.1).
3.E.I. Groups 0{n) and U(n). If we consider all linear mappings from R3 to R3 which preserve the given scalar product, that is, with respect to the definitions of the lengths of the vectors and deviations of two vector all linear mappings that preserve lengths and angles. Then these mappings form a group (see 1.1.1) with respect to the operation of composition. The composition of two such mappings is, by definition also a mapping that preserves lengths and angles, the unit element of the group is the identity mapping, and the inverse element for a given mapping is its inverse mapping. Such a mapping exists by the condition on the lengths preservation. The matrices of such mappings thus form a group with the operation of matrix multiplication (see ); it is called the orthogonal group and is denoted by 0(n). It is a subgroup of the group of all invertible mappings from R™ to R™.
Moreover, if we require that the matrices have determinant one, then we speak of the special orthogonal group 5*0(71). In general the determinant of a matrix in 0{n) can be either 1 or — 1. Similarly we define the unitary group U(n) as the group of all (complex) matrices that correspond to the complex linear mappings from Cn to Cn which preserve a given scalar product in a unitary space. Analogously, SU (n) denotes the subgroup of matrices in U (n) with determinant one. In general, the determinant of a matrix in U (n) can be any complex unit.
3.E.2. Consider the vector space V of functions R —> C. Determine whether the mapping p from the unitary space V is linear when:
i) ip(u) = \u where AeC
ii) p(u) = u*
iii) ip(u) = u2(= u.u)
iv) p(u) = f
For suitable functions V is a unitary space of infinite dimension. The scalar product is then defined by the relation
f-9 = f-00f(x)g(x)dx.
3.E.3. Show that if H is a Hermitian matrix, then U = exp(iH) = Y^n°=o ~h. (iH)n is a unitary matrix and compute its determinant.
Unitary spaces
Unitary space is a complex vector space V along with the mapping V x V —> C, (u, v) 1-» u ■ v called scalar product and satisfying for all vectors u,v,w e V and scalars a e C the following axioms:
(1) u ■ v = v ■ u (the bar stands for complex conjugation),
(2) (au) ■ v = a(u ■ v),
(3) (u + v) ■ w = u ■ w + v ■ w,
(4) if u =^ 0, then u ■ u > 0 (notice u ■ u is always real). The real number y/v ■ v is called the norm of the vector
v and a vector is normalized, if its norm equals one. Vectors u and v are said to be orthogonal if their scalar product is zero. A basis composed of mutually orthogonal and normalized vectors is called an orthonormal basis of V.
At first sight this is an extension of the definition of Euclidean vector spaces into the complex domain. We will continue to use the alternative notation (u, v) for the scalar product of vectors u and v. As in the real domain, we obtain immediately from the definition the following simple properties of the scalar product for all vectors in V and scalars in C:
u ■ u e R u ■ u = 0   if and only if  u = 0
u ■ (av) = d[u ■ v) u-(v + w)=u-v + u- w u ■ 0 = 0 ■ u = 0
i 3 i,j
where the last equality holds for all finite linear combinations. It is a simple exercise to prove everything formally. For instance, the first property follows from (1) since the product u ■ u has to be the complex conjugate to itself.
A standard example of the scalar product over the complex vector space Cn is
(xu . . .,xn)T ■ (yi,... ,xn)T = xxyx H-----h xnyn.
This expression is also called the standard (positive definite) Hermitian form on C". By conjugation of the coordinates of the second argument, this mapping satisfies all the required properties. The space Cn with this scalar product is called the standard unitary space of dimension n. We can denote this scalar product of vectors x and y with matrix notation as yT ■ x (here the complex conjugation indicated by the bar is performed on all components of y).
As usual, those mappings which leave the additional structure invariant are of great importance.
Unitary mappings
A linear mapping ip : V —> W between unitary spaces is called a unitary mapping, if for all vectors u, v G V
u ■ v = p(u) ■ p(v).
Unitary isomorphism is a bijective unitary mapping.
170
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution. From the definition of exp we can show that exp(A + B) = exp(vl). exp(B) just as with the exponential mapping in the domain of real numbers. Because (u + v)* =
u* + v* and (cv)* = cv*, we obtain
oo    1 oo 1
77 = 0      ' 77 = 0
and since H* = H, then
oo 1
U* = £(-!)"-(*#)" = exp(-iH).
77 = 0
Thus
U*U = exp(iH) exp(-iH) = exp(O) = 1.
det(U)
_ trace(iH)
□
3.E.4. Hermitian matrices A, B, C satisfy [A,C] = [B, C] = 0 and [A, B] ^ 0, where [, ] is a commutator of matrices denned denned by the relation [A, B] = AB — BA. Show that at least one eigensubspace of the matrix C must have dimension > 1.
Solution. We prove it by contradiction. Assume that all eigensubspaces of the operator C have dim = 1. Then for any vector u we can write u = J2k ckuk where uk are, linearly independent eigenvectors of the operator C associated with the eigenvalue A^ (and ck = u.uk) For these eigenvectors
0 = [A, C]uk = ACuk - CAuk = \kAuk - C(Auk).
From there it follows that Auk is an eigenvector of the matrix C with the eigenvalue A^. But then that Auk = \kuk for some number A^. Similarly, Buk = \kuk for some number \k . For the commutator of matrices A and B is then obtained
[A, B]uk = ABuk - BAuk = \k\kuk - \k\kuk = 0
so that
[A, B]u = [A, B] ^ ckuk -      Ck[A, B]uk = 0.
k k
Because u is arbitrary, it follws that [A, B] = 0, which is a contradiction. □
3.E.5. Applications to quantum physics. In quantum N\^kv physics we do not use numbers as in classical ^Af/Zc physics, but a Hermitian operator. This is nothing ?P but a Hermitian mapping, which can (and often does) lead to a linear transformation between unitary spaces of infinite dimension.   We can imagine this as a matrix
3.4.2. Real and complex spaces with scalar product. In
the previous chapter we have already derived some simple properties of spaces with scalar products. The properties and proofs are very similar to the complex case.
In the sequel we shall work with real and complex spaces simultaneously and write K for R or C. In the real case the conjugation is just the identity mapping (it is the restriction of the conjugation in the complex plane to the real line). As in the real case, we define the orthogonal complement for a vector subspace U C V in the unitary space V as
UJ
{v eV; u ■ v = 0 for all u e U},
which is clearly also a vector subspace in V.
Athough we deal exclusively with finitely-dimensional spaces now, the results in the next two theorems have a natural generalization for Hilbert spaces, which are infinitely-dimensional spaces with scalar products. We shall meet them later, in connection with approximation in vector spaces of real or complex valued functions.
Theorem. For every finitely-dimensional space V of dimension n with scalar product we have:
(1) There exists an orthonormal basis in V.
(2) Every system of non-zero orthogonal vectors in V is linearly independent and can be extended to an orthogonal basis.
(3) For every system of linearly independent vectors (ui,...,uk) there exists an orthonormal basis (vi,...,vn) such that (vi,...,Vi) = (ui...,Ui), for all 1 < i < k, i.e. its vectors consecutively generate the same subspaces as the vector Uj.
(4) If (ui,..., un) is an orthonormal basis V, then the coordinates of every vector u G V are expressed via
U = (u ■ Ui)Ui + ■■■ + («■ un)un.
(5) In any orthonormal basis, the scalar product has the coordinate form
u- v = y ■ x = xxyx H-----h xnyn
where x and y are columns of coordinates of the vectors u and v in a chosen basis. Notably, every n-dimensional space with scalar product is isomorphic to the standard Euclidean Rn or the unitary Cn.
(6) The orthogonal sum of unitary subspaces V\ + ■ ■ ■ + Vk in V is always a direct sum.
(7) If A C V is an arbitrary subset, then A1- C V is a vector subspace (and thus also unitary), and (A1-)1- C
V is exactly the subspace generated by A. Furthermore,
V = span^ffi A±.
(8) V is an orthogonal sum of n one-dimensional unitary subspaces.
Ill
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
of infinite dimension. Vectors in this unitary space then represent the states of the given physical system. When measuring a given physical quantity we obtain only values that are eigenvalues of the corresponding operator.
For instance, instead of the coordinate x we have an operator of the coordinate x, that results in multiplication by x. If the state of the system is described by the vector V, then x(y) = xv. This corresponds to the multiplication of the vector by the real number x. At first glance this Hermitian operator is different from the cases of finite dimension. Evidently every real number is an eigenvalue and (x has a continuous spectrum). Similarly, instead of speed (more precisely, momentum) we have the operator p = The eigenvec-
tors are the solution of the differential equation —i^ = Xv. Even in this case the spectrum is continuous. This expresses the fact that the corresponding physical quantity is continuous and can attain any real value. On the other hand, we have physical quantities, for instance energy, that can attain only discrete values (energy exists in quanta). The corresponding operators are then really similar to the Hermitian matrices. They have infinitely many eigenvalues.
3.E.6.   Show that x and p are Hermitian and that
[x,p] = i
Solution. For any vector v
r~ ~i      „ „      „ „ dv d(xv)
[x,p\v = xpv — pxv = x(—%-) + I--= IV
dx dx
from which the result follows. □ 3.E.7.   Show that
[x — p, x + p] = 2i
Solution. Evidently [x, x] = 0 and \p,p] = 0 The result follows from the linearity of the commutator from the previous exercise. □
3.E.8. Jordan form. Find the Jordan form of the following matrices. What is the geometric interpretation of this decomposition of the matrix?
A = (-4 Í
Proof. (1), (2), (3): First we extend the given system of vectors into any basis (u1,... ,un) of the space V and then start the Gramm-Schmidt orthogonal-ization from 2.3.20. This procedure works in the complex case. It yields an orthogonal basis with properties as required in (3). But from the Gramm-Schmidt orthogonalization algorithm it is clear that if the original k vectors formed an orthogonal system of vectors, then they continue to do so after the othogonalization process is applied. Thus we have also proved (2) and (1).
(4) : If u = aiui + ■ ■ ■ + anun, then
u ■ ui = ai(ui ■ ui) H----+ an{un ■ u{) = ai\\ui\\2 = a{
(5) : If u = xiui + ■ ■ ■ + xnun, v = y\U\ + ■ ■ ■ + ynun, then
u-v = (xiui -I-----h xnun) ■ (yiui H-----h ynun)
= xiyi H-----h xnyn.
(6) : We need to show that for any tuple Vi, Vj from the given subspaces their intersection is the zero vector. If u e Vi and u e Vj, then al«, that is, u ■ u = 0. This is possible only for the zero vector u e V.
(7) : Let u, v e A±. Then (au + bv) ■ w = 0 for all w e A, a, b e K (from the distributivity of the scalar product). Thus A1- is a subspace in V. Let (v1,..., vk) be a basis of span A chosen among the elements of A, and let ..., uk) be the orthonormal basis resulting from the Gramm-Schmidt orthogonalization of the vectors (v1,... ,vk). We extend it to an orthonormal basis of the whole V (both exist by the already proven parts of this proposition). Because it is an orthogonal basis, necessarily span{ufc+i,... ,un} = span{ui,..., Uk}1- = A1- and A C span{ufc+i,..., Un}1 (this follows from expressing the coordinates under the orthonormal basis). If u _L span{ufc+i ,...,«„}, then u is necessarily a linear combination of the vectors u±,..., uk, but that happens whenever it is a linear combination of the vectors vi,..., Vk, which is equivalent to u being in span A.
(8) : This is equivalent to the formulation of the existence of the orthonormal basis. □
3.4.3. Important properties of the norm. Now we have everything prepared for basic properties related to I JiZ-S our definition of the norm of vectors. We speak also of the length of vectors denned by the scalar product. Note also that all claims always consider finite sets of vectors, Their validity does not depend on the dimension of the space V where it all takes place.
Properties of norm
Theorem. Let V be a vector space with scalar product, u and v vectors in V. Then
(1) \\u + v\\ < \\u\\ + \\v\\. Equality holds if and only if u and v are linearly dependent. This is called the triangle inequality.
172
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution, i) First compute the characteristic polynomial of the matrix A
\A - \E\
-1 — A -6
1
4-A
A — 3A + 2
The eigenvalues of the matrix A are the roots of this polynomial, that means that Ai2 = 1,2. Since the matrix is of order two, and has two distinct eigenvalues, its Jordan form
is a diagonal matrix J = ^   ^ ■ The eigenvector (x, y)
associated with the eigenvalue 1 satisfies 0 = (A — E)x =
g 3) (y)' ^at *S' ^ y ~ ^' S° the eigenvectors are the multiples of the vector (1,2).
Similarly the eigenvector associated with the eigenvalue 2 is (1,3). The matrix P is then obtained by writing these
eigenvectors into tho columns, that is, P = ^   ^ ■ For the
matrix A, A = P ■ J ■ P"1. The inverse of P is P"1 =
3 -1\
, and
1
-1 1 -6 4
1 1
2 3
1 0 3 0   2 -2
1
This decomposition says that the matrix A determines a linear mapping that has as basis of the eigenvectors (1,2),(1,3), the aforementioned diagonal form. Geometrically, this means that in the direction (1,2) nothing is changing and in the direction (1,3) every vector is being stretched twice.
ii) The characteristic polynomial of the matrix A is in this case
\A - \E\
-1 - A -4
1
3-A
■ A — 2A + 1 = 0
There is a double root A = 1 and the corresponding eigenvector (x, y) satisfies
0 = (A - E)x =
The solutions are, as in the previous case, multiples of the vector (1,2). The fact that the system does not have two linearly independent vectors as a solution says that the Jordan form in this case is not optimal, but it will be a matrix ^   |^ . The
basis for which A has this form is the eigenvector (1,2) and a vector that maps on this vector by the mapping A — E. Thus it is a solution of the system of equations
( 2	1	1 \,	(-2 1	1 \,
1-4	2	2 j	\ 0 0	0 )
(2) \u ■ v\ < \\u\\ \\v\\. Equality holds if and only if u and v are linearly dependent. This property is called the Cauchy inequality.
(3) If(ei,...,ek) is a orthonormal system of vectors, then
2
||u||2 > \u ■ e1\2 +
+ \u ■ ek\
This property is called the Bessel inequality.
(4) If(ei,...,ek) is an orthonormal system of vectors, then u £ span{ei,..., ek} if and only if
II   l|2       I 2   1 1   I |2
=|u-ei| + ■ ■ ■ + \u ■ ek\ .
This is called the Parseval equality.
(5) If(ei,..., ek) is an orthonormal system of vectors and u £ V, then the vector
w=(u- e1)e1 H-----h (u ■ ek)ek
is the only vector which minimizes the norm \\u — v\\ among all v £ span{ei,.. ., ek}.
Proof. The verifications are all based on direct computations:
(2): The result is obvious if v = 0. Otherwise, define the vector w = u — ^v, that is, w _L v and compute
2 (TFv)
(u-v)(u-v)
IMI4
llwll2!!^!2 = HulHli'll2 - 2(u ■ v)(vTv) + (u ■ v)(v~~v)
These are non-negative real values and thus, ||w||2|M|2 > \u ■ v\2 and the equality holds if and only if w = 0, that is, whenever u and v are linearly dependent. (1): It suffices to compute
||u + w||2 = ||u||2 + ||w||2 + u- w + w- u = ||wf + ||w||2 + 2Re(u-'y)
< \\u\\2 + ||w||2 + 2|u-'y| < ||u||2 + |H|2 + 2||u|||H|
= (\H\ + \M)2
Since we deal with squares of non-negative real numbers, this means that + < ||u|| + ||w||. Furthermore, equality implies that in all previous inequalities equality also holds. This is equivalent to the condition that u and v are linearly dependent (using the previous part).
(3), (4): Let (e1,..., ek) be an orthonormal system of vectors. We extend it to an orthonormal basis (ei,..., e„) (that is always possible by the previous theorem). Then, again using the previous theorem, we have for every vector u £ V
n n k
\\u\\2 = ^^(u ■ ei)(u ■ et) = ^2\u- ei\2 > ^ \u ■ e{\2
i=l i=l i=l
But that is the Bessel inequality. Furthermore, equality holds if and only if u ■ e{ = 0 for all i > k, which proves the Parseval equality.
(5): Choose an arbitrary v £ sp&n{e1,... ,ek} and extend the given orthonormal system to the orthonormal basis
(ei,... ,e„). Let («i,... ,un) and (xu ...,xk, 0,..., 0) be
173
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The solutions are multiples of the vector (1,3). We obtain the same basis as in the previous case and we can write
-4     3 J        1^2     3yl  1^0     lyl  ^-2 1
The mapping now acts on the vector as follows: the component in the direction (1,3) stays the same. The component in the direction (1,2) is multiplied by the sum of the coefficients that determine the components in the directions (1,3) and (1,2). □
3.E.9. Find the Jordan form of the matrices A1 and A2, and write down the decomposition. What is the geometric interpretation of this decomposition? A1 = | ( ^      ^ ] and
A,=
5 -1 3 U 1
and show how the vectors v = (3,0), Aiv
and A2v decompose with respect to the basis of the eigenvectors of the matrix dissolution. The matrices have the same Jordan forms as the matrices in the previous exercise. In the basis of the vectors
(1,2) and (1,-1),
and
1/5 -1 3 V-2 4
1 [5 -1 3 U 1
1 11   0     1 1
2 -1 mo   2     2 -1
1 11   11 1
2 -1     0   12 -1
For the vector v = (3,0), v = (1,2) + 2(1, -1). For its images, Axv = (5, -2) = (1,2) + 2 ■ 2 ■ (1, -1) and A2v = (5,4) = (2 + 1)-(1,2)+ 2-(1,-1). □
F. Matrix decompositions
3.F.I.   Prove or disprove:
• Let A be a square matrix n x n. Then the matrix AT A is symmetric.
• Let A be a square matrix with only real positive eigenvalues. Then A is symmetric.
3.F.2.   Find an LU-decomposition of the following matrix:
1-2   1 0\
-4   4 2 V-6   1 -1/
Solution.
/I    0    0\  (-2   1 0\ 2    1    0       0    2 2
\3   -1   1/ \ 0    0 1/ First multiply the matrices that correspond to the Gaussian elimination, we thus obtain for the original matrix A, XA =
coordinates of u and v under this basis. Then
||u—u||2 = |ui-a;i|2H-----\-\uk-xk\2 + \uk+1 |2H-----\-\un
and this expression is clearly minimized when choosing the individual vectors to be x1 = u\,..., xk = uk. □
3.4.4. Unitary and orthogonal mappings. The properties Q        of orthogonal mappings have direct analogues •T~\.     in the complex domain. We can easily formu-^ ^e them and prove together:
Proposition. Consider the linear mapping (endomorphism) p : V —> V on the (real or complex) space with scalar product. Then the following conditions are equivalent.
(1) p is unitary or orthogonal transformation,
(2) p is linear isomorphism and for every u, v G V
p(u) ■ v = u ■ p_1(v),
(3) the matrix A of the mapping p in any orthonormal basis satisfies +_1 = AT (for Euclidean spaces this means that A-1 = AT),
(4) The matrix A of a mapping p in some orthonormal basis satisfies +_1 = AT,
(5) The rows of the matrix A of the mapping p in an orthonormal basis form an orthonormal basis of the space Kn with standard scalar product,
(6) The columns of the matrix A of the mapping p in an orthonormal basis form an orthonormal basis of the space Kn with standard scalar product.
Proof. (1) => (2): The mapping p is injective, therefore it must be onto. Also p(u) ■ v = p(u) ■ p(p~1(v)) = u ■ p~x(v).
(2) => (3): The standard scalar product is in K". It is given for columns x, y of scalars by the expression x-y = yTE x = y x, where E is the unit matrix. Property (2) thus means that the matrix A of the mapping p is invertible and yT A x = (A~1y)Tx. This means that (yT A - (A-1y)T)x = 0 for all i £ K". By substituting the complex conjugate of the expression in the parentheses for x we find that equality is possible only when AT = +_1. (We may also rewrite the expression as yT(A — (A~1)T)x and see the conclusion by substituting the basis vectors for x and y.)
(3) => (4): This is an obvious implication.
(4) => (5) In the relevant basis, the claim is expressed via the matrix A of the mapping p as the equation AAT = E, which is ensured by (4).
(5) => (6): We have\ATA\ = \E\ = \AAT\ = \A\\A\ = 1, there exists the inverse matrix +_1. But we also have AATA = A, therefore also ATA = E which is expressed exactly by (6).
(6) => (1): In the chosen orthonormal basis
p(u) ■ p(v) = (Ay)   Ax = yA Ax = y Ex = y x
where x and y are columns of coordinates of the vectors u and v. That ensures that the scalar product is preserved. □
174
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
U, where X is a lower triangular matrix given by the Gaussian reduction, and U upper triangular. From this equality A = X~x U, which is the desired decomposition. (Thus we have to compute the inverse of X). □
3.F.3.   Find   the   LU-decomposition   of   the matrix
C -!1 !)■ 0
3.F.4. Ray-tracing. In computer 3D-graphics the image is very often displayed using the Ray-tracing algorithm. The basis of this algorithm is an approximation of the light waves by a ray (line) and an approximation of the displayed objects by polyhedrons. These are bounded by planes and it is necessary to compute where exactly the light rays are reflected from these planes. From physics we know how the rays are reflected - the angle of impact equals the angle of reflection. We have already met this topic in the exercise I.E.10.
The ray of light in the direction v = (1,2,3) hits the plane given by the equation x + y + z = l.]n what direction is it reflected?
Solution.   The unit normal vector to the plane is
n = ^5= (1,1,1). The vector that gives the direction of the reflected ray vr lies in the plane given by the vectors v, n. We can express it as a linear combination of these vectors. Furthermore, the rule for the angle of reflection says that (ii,n) = —{vr,ti). From there we obtain a quadratic equation for the coefficient of the linear combination.
This exercise can be solved in an easier, more geometric way. From the diagram we can derive directly that
Vfj = v — 2(i>, n)n In our case, vr = (—3, —2, —1).
□
3.F.5. Singular decomposition, polar decomposition, pseudoinverse. Compute the singular decomposition of the matrix
A ■
0 0
-10 0 V 0   0    0 /
. Then compute its polar decomposition and find its pseudoinverse.
Solution. First compute A1A:
A1 A:
	0	°\
= 0	0	0
Vo	0	\
Characterizations from the previous theorem deserve J.',, some notes. The matrices A e Mat„(K) with the property A-1 = AT are called unitary matrices for I* complex scalars (in the case R we have already used the name orthogonal matrices for them). The definition itself immediately implies that a product of unitary (orthogonal) matrices is again unitary (orthogonal). The same is true for inverses. Unitary matrices thus form a subgroup U(n) c G1„(C) in the group of all invertible complex matrices with the product operation. Orthogonal matrices form a subgroup 0(n) c G1„(R) in the group of real invertible matrices. We speak of a unitary group and of an orthogonal group.
The simple calculation
1 = det E = det(AAT) = det Adet A = | det A\2
shows that the determinant of a unitary matrix has norm equal to one. For real scalars the determinant is ±1. Furthermore, if Ax = Aa; for a unitary or orthogonal matrix, then (Ax) ■ (Ax) = x ■ x = | A |2 (a; ■ x). Therefore the real eigenvalues of orthogonal matrices in the real domain are ±1. The eigenvalues of unitary matrices are always complex units in the complex plane.
The same argument as we have seen with the orthogonal mappings imply that orthogonal complements of invariant subspaces with respect to unitary mappings p : V —> V are also invariant. Indeed, if p(U) C U, u e U and v e U1-are arbitrary, then
p(v) ■ p(p~1(u)) = v ■ p'1^).
Because the restriction p\u is also unitary, it is a bijection. Notably p_1(u) e U. But then p(v)-u = O.becausew G U±. Thus p(v) e U±.
This leads to an immediate useful corollary in the complex domain
Corollary. Let p : V —> V be a unitary mapping of complex vector spaces. Then V is an orthogonal sum of one-dimensional eigensubspaces.
Proof. There exists at least one eigenvector v e V, since complex eigenvalues always exist. Then the restriction of p to the invariant subspace (w)± is again unitary and also has an eigenvector. After n such steps we obtain the desired orthogonal basis of eigenvectors. After normalising the vectors we obtain an orthonormal basis. □
Now it is possible to understand the details of the proof of the spectral decomposition of the orthogonal mapping from 2.4.7 at the end of the second chapter. The real matrix of an orthogonal mapping is interpreted as a matrix of a unitary mapping on a complex extension of Euclidean space. We observe the corollaries of the structure of the roots of the real characteristic polynomial over the complex domain. Automatically we obtain invariant two-dimensional subspaces given by pairs of complex conjugated eigenvalues and hence the corresponding rotation for restricted original real mapping.
175
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
to obtain a diagonal matrix. We need to find an orthonormal basis under which the matrix is diagonal and the zero row is the last one. This can be obtained by rotating about the x-axis through a right angle. The y-coordinate then goes to z
and z goes to —y. This rotation is an orthogonal transforma-
/I    0 0\
tion given by the matrix V = 10    0    1   . By this, we
\0   -1 Oj
have found the decomposition AT A = VBVT. here, B is diagonal with eigenvalues (1,^,0) on the diagonal. Because B = (AV)T(AV), the columns of the matrix
0    0 -I
AV ■
form an orthogonal system of vectors, which we normalise and extend to a basis. That is then of the form (0,-1,0), (1,0,0), (0,0,1). The transition matrix of changing from this basis to the standard one is then
0
U
Finally, we obtain the decomposition A = UVBV
1 0 0\ I 0 0 \ 0 0 0 0   0   0/ \0 1
The geometrical interpretation of decomposition is the following: first, everything is rotated through a right angle by the x-axis, then follows a projection to the xy plane such that the unit ball is mapped on the ellipse with major half-axes 1 and \. The result is then rotated through a right angle about the z-axis.
The polar decomposition A = P ■ W can be obtained from the singular one: P :=UV~BUT and W := UVT, that is,
1   0 0
0   0 0
'\ 0 °^
0   1 0
,0    0 0;
3.4.5. Dual and adjoint mappings. When discussing vec-J.i,, tor spaces and linear mappings in the second chapter, we mentioned briefly the dual vector space V* of all | linear forms over the vector space V, see 2.3.17. This duality extends to mappings:
Dual mappings
For any linear mapping > •: I    • U , the expression
(1) {v,il>*(a)) = {il>(v),a),
where ( , } denotes the evaluation of the linear forms (the second argument) on the vectors (the first argument), while v e V and a e W* are arbitrary, defines the mapping t/j* : W* —> V* called the dual mapping to t/j.
Choose bases v in V, w in W and write A for the matrix of the mapping t/j in these bases. Then we compute the matrix of the mapping t/j* in the corresponding dual bases in the dual spaces. Indeed, the definition says that if we represent the vectors from W* in the coordinates as rows of scalars, then the mapping t/j* is given by the same matrix as t/j, if we multiply by it the row vectors from the right:
(ip(v),a) = (ax,... ,an) ■ A ■
-^Dhls means that the matrix of the dual mapping ip* is the trans-(pqje AT, because a ■ A = (AT ■ aT)T.
Assume further that we have a vector space with scalar product. Then we can naturally identify V and V* using the scalar product. Indeed, choosing one fixed vector w e V, we substitute this vector into the second argument in the scalar product in order to obtain the identification V ~ V* = Hom(V; K)
V 3 w i-> (v i-> (v,w)) e V*.
The non-degeneracy condition on the scalar product ensures that this mapping is a bijection. Notice it is important to use w as the fixed second argument in the case K = C in order to obtain linear forms. Since factorizing complex multiples in the second argument yields complex conjugated scalars, the identification V ~ V* is linear over real scalars only.
It is clear that the vectors of an orthonormal basis are mapped to forms that constitute the dual basis, i.e. the orthonormal basis are selfdual under our identification. Moreover, every vector is automatically understood as a linear form, by means of the scalar product.
How does the above dual mapping W* —> V* look in terms of our identification? We use the same notation ip* '■ W —> V for the resulting mapping, which is uniquely given as follows:
176
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
\ 2	0	
= 0	1	0
/ Vo	0	o
From this it follows that
/0    0 -I -10 0 \ 0    0 0
The pseudoinverse matrix is then given by the expression
/l   0 0\
vl(-i) := VS UT, where ff =   0   2   0  . Thus,
\0   0 0/
Adjoint mapping
A	0	°\	/°	-i	°\
0	2	0	1	0	0
Vo	0	o/	Vo	0	1
□
3.F.6. QR decomposition. The QR decomposition of a matrix vl is very useful when we are given a system of linear equations Ax = b which has no solution, but an approximation as good as possible is needed. That is, we want to minimize \\Ax — b\\. According to the Pythagorean theorem, \\Ax — b\\2 = \\Ax — 6|| ||2 + ||&±||2, where b is decomposed into &n which belongs to the range of the linear transformation A, and into b±, which is perpendicular to this range. The projection on the range of A can be written in the form QQT for a suitable matrix Q. Specifically for this matrix we obtain it by the Gram-Schmidt orthonormalisation of the columns of the matrix A. Then Ax - fry = Q(QTAx - QTb). The system in the parentheses has a solution, for which \\Ax — b\\ = ||b±||, which is the minimal value. Furthermore, the matrix R := QTA is upper triangular and therefore the approximate solution can be found easily.
Find an approximate solution of the system
x + 2y = 1 2x + Ay = A
Solution. Consider the system Ax = b with A =
and b -
which evidently has no solution. We orthonor-
malise the columns of A. We take the first of them and divide
it by its norm. This yields the first vector of the orthonormal
A
basis7S I 2
But the second is twice the first and thus it will be after orthonormalisation. Therefore Q =
The projector on the range of A is then
For every linear mapping ip : V —> W between spaces with scalar products, there is the adjoint mapping ip* uniquely determined by the formula
(2) (yj(u),v) = (u,r(v))-
The parentheses means the scalar products on W or V, respectively.
Notice that the use of the same parenthesis for evaluation of one-forms and scalar products (which reflects the identification above) makes the denning formulae of dual and adjoint mappings look the same.
Equivalently we can understand the relation (2) to be the definition of the adjoint mapping ip*. By substituting all pairs of vectors from an orthonormal basis for the vectors u and v we obtain directly all the values of the matrix of the mapping ip*. Using the coordinate expression for the scalar product, the formula (2) reveals the coordinate expression of the adjoint mapping:
(ip(v),w) = (»1,
(v,ip*(w)}.
It follows that if A is the matrix of the mapping ip in an orthonormal basis, then the matrix of the adjoint mapping ip* is the transposed and conjugated matrix A - we denote this by
A* = AT.
The matrix A* is called the adjoint matrix of the matrix A. Note that the adjoint matrix is well defined for any rectangular matrix. We should not confuse them with algebraic adjoints, which we used for square matrices when working with determinants.
We can summarise. For any linear mapping ip : V —> W between unitary spaces, with matrix A in some bases on V and W, its dual mapping has the matrix AT in the dual basis. If there are scalar products on V and W, we identify them (via the scalar products) with their duals. Then the dual mapping coincides with the adjoint mapping ip* : W —> V, which has the matrix A*. The distinction between the matrix of the dual mapping and the matrix of the adjoint mapping is thus in the additional conjugation. This is of course a consequence of the fact that our identification of the unitary space with its dual is not a linear mapping over complex scalars.
3.4.6. Self-adjoint mappings. Those linear mappings which coincide with their adjoints: ip* = ip, are of particular interest. They are called 1 self-adjoint mappings. Equivalently we can say that they are the mappings whose matrix A satisfies A = A* in some (and thus in all) orthonormal basis.
177
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Next,
and
1
71
_9_
(5 9)
The approximate solution then satisfies Rx = QTb, and here that means 5x + 9y = 9. (The approximate solution is not unique). The QR decomposition of the matrix A is then
□
/2    -1 -1\ 3.F.7.   Minimise ||    - 6|| for A =   -1    2    -1 and
V-l   -1 2y/
& = 10 1. Hence write down the QR decomposition of the
w
matrix A.
Solution. The normalised first column of the matrix A is ei =      J — 1 J. From the second column, subtract its com-
-1,
ponent in the direction e1. Then
By this we have created an orthogonal vector, which we nor-
malise to obtain e2 = -4? I 1    . The third column of the
matrix A is already linearly dependent (verify this by computing the determinant, or otherwise). The desired column-orthogonal matrix is then
,    / 2       0 \
V6
-1 y/3 v-l -V3J
Next,
R =
and
In the case of Euclidean spaces the self-adjoint mappings are those with symmetric matrices (in orthonormal basis). They are often called symmetric mappings.
In the complex domain the matrices that satisfy A = A* are called Hermitian matrices or also Hermitian symmetric matrices. Sometimes they are also called self-adjoint matrices. Note that Hermitian matrices form a real vector subspace in the space of all complex matrices, but it is not a vector sub-space in the complex domain.
Remark. The next observation is of special interest. If we multiply a Hermitian matrix A by the imaginary unit, we obtain the matrix B = i A, which has the property
iAT ■
B*
-B.
Such matrices are called anti-Hermitian or Hermitian skew-symmetric. Every real matrix can be written as a sum of its symmetric part and its anti-symmetric part,
A=\(A + AT) + \{A-AT). In the complex domain we have analogously
A=±(A + A*)+i±:(A-A*).
In particular, we may express every complex matrix in a unique way as a sum
A = B + iC
with Hermitian symmetric matrices B and C. This is an analogy of the decomposition of a complex number into its real and purely imaginary component and in the literature we often encounter the notation
B = reA= -(A + A*), C = imyl= —(A-A*). 2 2i
In the language of linear mappings this means that every complex linear automorphism can be uniquely expressed by means of two self-adjoint mappings playing the role of the real and imaginary parts of the original mapping.
Spectral decomposition. Consider a self-adjoint mapping t/j : V —> V with the matrix A in some orthonormal basis. Proceed similarly as in 2.4.7 when we diagonalized the matrix of orthogonal mappings.
Again, consider arbitrary invariant subspaces of self-adjoint mappings and their orthogonal complements. If a self-adjoint mapping t/j : V —> V leaves a subspace W C V invariant, i.e. ip(W) C W, then for every v e W±, w e W
(ip(v),w) = (v,ip(w)) = 0.
Thus also, ip{WA C W±.
Next, consider the matrix A of a self-adjoint mapping in an orthonormal basis and an eigenvector x e Cn, i.e. A ■ x = \x. We obtain
\{x, x) = {Ax, x) = {x, Ax) = {x, Xx) = \{x, x).
178
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
The solution of the equation Rx = QTb is x = y = z. Thus, multiples of the vector (1,1,1) minimize \\Ax — b\\.
The mapping given by the matrix A is a projection on the plane with normal vector (1,1,1).
□
3.F.8. Linear regression. The knowledge obtained in this chapter can be successfully used in practice for solving problems with linear regression. It is about finding the best approximation of some functional dependence using a linear function.
Given a functional dependence for some points that is, f{a\, ...,an)= yi,..., f(ak, a\,. .., akn) = yk, k > n (we have thus more equations than unknowns) and we wish to find the "best possible" approximation of this dependency using a linear function. That is, we want to express the value of the property as a linear function f(x1,..., xn) = b1x1 + b2x2 + ■ ■ ■ + bnxn + c. We choose to define "best possible" by the minimisation of
k     I n \ 2
i=l  \ j=l J
with regard to the real constants bi,...,bn, c. The goal is to find such a linear combination of the columns of the matrix A = (aj) (with coefficients b1,..., bn), that is closest to the vector (yi,..., yk) in Rfe. Thus it is about finding an orthogonal projection of the vector (yi,..., yk) on the sub-space generated by the columns of the matrix A. Using the theorem 3.5.7 this projection is the vector (b1,..., bn)T = A(~1\y1,...,bn).
3.F.9.   Using the least squares method, solve the system
2x + y + 2z   = 1 x + y + 3z   = 2 2x + y + z   = 0 x + z   = —1
Solution. The system has no solution, since its matrix has rank 3, and the extended matrix has rank 4. The best approximation of the vector b = (1,2, 0, —1) can thus be obtained using the theorem 3.5.7 by the vector A^^b. AA^^b is then the best approximation - the perpendicular projection
The positive real number (x,x) can be cancelled on both sides and thus A = A, and we see that eigenvalues of Hermitian matrices are always real.
The characteristic polynomial det(vl — XE) has as many complex roots as is the dimension of the square matrix A (including multiplicities), and all of them are actually real. Thus we have proved the important general result:
Proposition. The orthogonal complements of invariant sub -spaces of self-adjoint mappings are also invariant. Furthermore, the eigenvalues of a Hermitian matrix A are always real.
The very definition ensures that restriction of a self-adjoint mapping to an invariant subspace is again self-adjoint. Thus the latter proposition implies that there always exists an orthonormal basis of V composed of eigenvectors. Indeed, start with any eigenvector vi, normalize it, consider its linear hull Vi and restrict the mapping to Vf1. Consider next another eigenvector v2 e V^, take V2 = span(Ki U {v2}), which is again invariant. Continue and construct the sequence of invariant subspaces V\ C V2 C ... Vn = V, building the orthonormal basis of eigenvectors, as expected.
Actually, it is easy to see directly that eigenvectors associated with different eigenvalues are perpendicular to each other. Indeed, if ip(u) = Aw, ip(v) = fiv then we obtain
A(u, v) = (ip(u),v) = (u, ip(v)) = [i{u, v) = v).
Usually this result is formulated using projections onto eigensubspaces. Recall the properties of projections along subspaces, as discussed in 2.3.19. A projection P : V —> V is a linear mapping satisfying P2 = P. This means that the restriction of P to its image is the identity and the projector is completely determined by choosing the subspaces Im P and KerP.
A projection P : V —> V is called orthogonal if Im P _L Ker P. Two orthogonal projections P, Q are called mutually perpendicular if Im P _L Im Q.
Spectral decomposition of self-adjoint mappings
Theorem (Spectral decomposition). For every self-adjoint mapping ip : V —> V on a vector space with scalar product there exists an orthonormal basis composed of eigenvectors. If Ai,..., Xk are all distinct eigenvalues of ip and if Pi,..., Pk are the corresponding orthogonal and mutually perpendicular projectors onto the eigenspaces corresponding to the eigenvalues, then
ip = X1P1 + ■ ■ ■ + XkPk.
The dimensions of the images of these projections Pi equal the algebraic multiplicities of the eigenvalues A^.
179
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
of the vector b on the space generated by the columns of the matrix A.
Because the columns of the matrix A are linearly independent, its pseudoinverse is given by the relation (ATA)~1AT. Hence
The desired x is
A^b = (-6/5,7/3,l/3)T
The projection (the best possible approximation to the  column  of the right  side)  is  then the vector
(3/5,32/15,4/15,-13/15). □
3.4.8. Orthogonal diagonalization. Linear mappings which allow for orthonormal bases as in the latter theorem on spectral decomposition are called orthogonally diagonalizable. Of course, they are exactly the mappings for which we can find an orthonormal basis in which the matrix of the mapping is diagonal. We ask what they look like.
In the Euclidean case, this is simple: diagonal matrices are first of all symmetric, thus they are the self-adjoint mappings. As a corollary we note that an orthogonal mapping of an Euclidean space into itself is orthogonally diagonalizable if and only if it is self-adjoint.They are exactly the self-adjoint mappings with eigenvalues ±1.
The situation is much more interesting on unitary spaces. Consider any linear mapping p : V —> V on a unitary space. Let p = -p + if] be the (unique) decomposition of p into its Hermitian and anti-Hermitian part. If p has diagonal matrix D in a suitable orthonormal basis, then D = Re D + i Im D, where the real and the imaginary parts are exactly the matrices of -p and r\. This follows from the uniqueness of the decomposition. Knowing this in the particular coordinates, we conclude the following computation relations at the level of mappings ip o 77 = 77 o ip (i.e. the real and imaginary parts of p commute), and pop* = p* o p (since this clearly holds for all diagonal metrices). The mappings p : V —> V with the latter property are called the normal mappings.
A detailed characterization is given by the following theorem (stated in the notation of this paragraph):
Theorem. The following conditions on a mapping p : V —> V on a unitary space V are equivalent:
(1) p is orthogonally diagonalizable,
(2) p* o p = p o p* (p is a normal mapping),
(3) ip o rj = n o ip (the Hermitian and anti-Hermitian parts commute),
(4) ifA= (ciij) is the matrix ofp in some orthonormal basis, and Xi are the m = dim V eigenvalues of A, then
m m i,j = l i=l
Proof. The implication (1) => (2) was discussed above. (2)     (3): it suffices to calculate
ip o p* = ('p + ir])(ip — irf) = -p2 + rf + i(rj-p — -prf)
ip* o p = ('p — iT])(t/j + irf) = -p2 + ry2 + i(-prj — rj-p)
Subtraction of the two lines yields
pp* — p*p = 2i(rj-p — iprf).
(2) => (1): If p is normal, then
(p(u),p(u)) = (p*p(u),u) = (pp*(u),u)
= (p*(u),p*(u))
thus \p(u) \ = \p*(u)\.
Next, notice (p - A id V)* = (p* - A id V). Thus, if p is normal, then (p — A id V) is normal too.
180
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
If ip(u) = Aw, then u is in the kernel of p — A idy. Thus the latter equality of norms of values for normal mappings and their adjoints ensures that u is also in the kernel of ip* — A idy. It follows that ip* (u) = Aw. We have proved, under the assumption (2), that ip and ip* have the same eigenvectors and that they are associated to conjugated eigenvalues.
Similarly to our procedure with self-adjoint mappings, we now prove orthogonal diagonalizability. The latter procedure is based on the fact that the orthogonal complements to sums of eigenspaces are invariant subspaces.
Consider an eigenvector u e V with eigenvalue A, and any v e (u)±. We have
(p(v),u) = (v, <p*(u)} = (v, Am) = A(w, v) = 0.
Thus ip(v) e (u)±. The same occurs if w is replaced by a sum of eigenvectors instead.
(1) => (4): the expression J2i j \aij\2 *s the trace °f the matrix AA*, which is the matrix of the mapping pop*. Therefore its value does not depend on the choice of the or-thonormal basis. Thus if p is diagonalizable, this expression equals exactly J2i    12 ■
(4) => (1): This part of the proof is a direct corollary of the Schur theorem on unitary triangulation of an arbitrary linear mapping V —> V, which we prove later in 3.4.15. This theorem says that for every linear mapping p : V —> V there exists an orthonormal basis under which p has an upper triangular matrix. Then all the eigenvalues of p appear on its diagonal. Since we have already shown that the expression J2i j \aij\2 does not depend on the choice of the orthonormal bases, all elements in the upper triangular matrix, which are not on the diagonal must be zero. □
Remark. We can rephrase the main statement of the latter theorem in terms of matrices. A mapping is normal if and only if its matrix A satisfies A A* = ^4* ^4 in some orthonormal basis (and equivalently in any orthonormal basis). Such matrices are called normal. Moreover, we can consider the last theorem as a generalization of standard calculations with complex numbers. The linear mappings appear similar to complex numbers in their algebraic form. The role of real numbers is played by self-adjoint mappings, and the unitary mappings play the role ofthe complex units cos t+i sin t e C. The following consequence of the theorem shows the link to the property cos21 + sin2 t = l.
Corollary. The unitary mappings on a unitary space V are exactly those normal mappings p on V for which the unique decomposition p = ip + in into Hermitian and anti-Hermitian parts satisfies ip2 + n2 = idy.
Proof. If p is unitary, then pp* = idy = p*p and thus pp* = (ip + in)(ifj — in) = ip2 + 0 + n2 = idy. On the other hand, if p is normal, we can read the latter computation backwards which proves the other implication. □
181
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.4.9. Roots of matrices. Non-negative real numbers are exactly those which are squares of real numbers (and thus we may find their square roots). At the same time, their positive square roots are uniquely defined. Now we observe a similar behaviour of matrices of the form B = A* A. Of course, these are the matrices of the compositions of mappings p with their adjoints.
By definition,
(1)        (Bx,x) = (A*Ax,x) = (Ax, Ax) > 0
for all vectors x. Furthermore, we clearly have
B* = (A* A)* = A* A = B.
Hermitian matrices B with the property (Bx,x) > 0 for all x are called positive semidefinite matrices. If the zero value is attained only for x = 0, they are called positive definite. Analogously, we speak of positive definite and positive semi-definite (self-adjoint) mappings p : V —> V.
For every mapping p : V ^ V we can define its square root as a mapping ip such that ip oip = p. The next theorem completely describes the situation when restricting to positive semidefinite mappings.
Positive semidefinite square roots
Theorem. For each positive semidefinite square matrix B, there is the uniquely defined semidefinite square root \/B.
If P is any matrix such that P~1BP = D is diagonal, then \/B = P\/rDP~1, where D has got the (non-negative) eigenvalues of B on its diagonal and \/D is the matrix with the positive square roots of these values on its diagonal.
Proof. Since B is a matrix of a sef-adjoint mapping p, there is even an orthonormal P as in the theorem with all eigenvalues in the diagonal of D non-negative. Consider C = VB as defined in the second claim and notice that indeed
C2 = PVDP^PVDP = PDP'1 = B.
Thus the mapping ip given by C must have the same eigenvectors as p and thus these two mappings share the decompositions of K" into mutually orthogonal eigenspaces. In particular, both of them will share the bases in which they have diagonal matrices and thus the definition of V~D must be unique in each such basis. This proves that the definition of \fB does not depend on our prticular choice of the diago-nalization of p. □
Notice there could be a lot of different roots, if we relax the positivity condition on \fB, see ??.
3.4.10. Spectra and nilpotent mappings. We return to the W behavior of linear mappings in full generality. 4?0fwt/ We continue to work with real or complex vector cJkj*^-^ spaces, but without necessarily fixing a scalar product there.
182
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Recall that the spectrum of a linear mapping f : V —> V is a sequence of roots of the characteristic polynomial of the mapping /, counting multiplicities. The algebraic multiplicity of an eigenvalue is its multiplicity as a root of the characteristic polynomial. The geometric multiplicity of an eigenvalue is the dimension of the corresponding subspace of eigenvectors.
A linear mapping / : V —> V is called nilpotent, if there exists an integer k > 1 such that the iterated mapping fk is identically zero. The smallest k with such a property is called the degree of nilpotency of the mapping /. The mapping / : V —> V is called cyclic, if there exists the basis ..., un) of the space V such that /(ui) = 0 and = ui_1 for all i = 2,..., n. In other words, the matrix of / in this basis is of the form
/0 1 0 ...\ A=\0   0   1 ••
If f(v) = a v, then jk(y) = ak ■ v for every natural k. Note that, the spectrum of nilpotent mapping can contain only the zero scalar (and this is always present).
By the definition, every cyclic mapping is nilpotent. Moreover, its degree of nilpotency equals the dimension of the space V. The derivative operator on polynomials, D(xk) = kxk~x, is an example of a cyclic mapping on the spaces K„ [x] of all polynomials of degree at most n over the scalars K.
Perhaps surprisingly, this is also true the other way round - every nilpotent mapping is a direct sum of cyclic mappings. A proof of this claim takes much work. So we formulate first the results we are aiming at, and only then come back to the technical work.
In the resulting theorem describing the Jordan decomposition, the crucial role is played by vector (sub)spaces and linear mappings with a single eigenvalue A given by the matrix
(1)
/A   1 0 0   A 1
\0   0 0
0
V
These matrices (and the corresponding invariant subspaces) are called Jordan blocks.4
Camille Jordan was a famous French Mathematician working in Analysis and Algebra at the end of the 19th and the beginning of the 20th centuries.
183
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Jordan canonical form
Theorem. Let Vbe a real or complex vector space of dimension n. Let f : V —> V be a linear mapping with n eigenvalues (in the chosen domain of scalars), counting algebraic multiplicities. Then there exists a unique decomposition of the space V into the direct sum of subspaces
V = Vi ffi ■ ■ ■ ffi Vk
where not only f(Vi) C Vi, but the restriction of f to each Vi has a single eigenvalue A, and the restriction f — \ idys on Vi is either cyclic or is the zero mapping. In particular, there is a suitable basis in which f has a block-diagonal matrix with Jordan blocks along the diagonal.
We say that the matrix j from the theorem is in Jordan canonical form. In the language of matrices, we can rephrase the theorem as follows:
Corollary. For each square matrix a over complex scalars, there is an invertible matrix P such that a = P~x j P and j is in canonical Jordan form.
The matrix P is the transition matrix to the basis from the theorem above. Notice that the total number of ones over the diagonal in j equals the difference between the total algebraic and geometric multiplicity of the eigenvalues. The ordering of the blocks in the matrix corresponds to the chosen ordering of the subspaces Vi in the direct sum. Thus, the uniqueness of the matrix j is true up to the ordering of the Jordan blocks. There is therefore freedom in the choice of the basis for such a Jordan canonical form.
3.4.11. Remarks. The Jordan canonical form theorem is already proved for the cases when all eigenvalues are either distinct or when the geometric and algebraic multiplicities of the eigenvalues are the same. In particular, it is proved for all unitary, normal and self-adjoint mappings on unitary vector spaces.
A consequence of the Jordan canonical form theorem is that for every linear mapping /, every eigenvalue of / uniquely determines an invariant subspace that corresponds to all Jordan blocks with this particular eigenvalue. We shall call this subspace the root subspace corresponding to the given eigenvalue.
We mention one useful corollary of the Jordan theorem (which is already used in the discussion about the behavior of Markov chains). Assume that the eigenvalues of our mapping / are all of absolute value less than one. Then repeated application of the linear mapping on every vector v e V leads to a decrease of all coordinates of fh(v) towards zero, without bounds.
Indeed, assume / has only one eigenvalue A on all the complex space V and that / — A idy is cyclic (that is, we consider only one Jordan block separately). Let vi,..., vi be the corresponding basis. Then the theorem says that j(y2) = Xv2 + vi, f2{v2) = X2v2 + A«i + Xvi = \2v2 + 2Xvi, and
184
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
similarly for other vt's and higher powers. In any case, the iteration of / results in higher and higher powers of A for all non-zero components. The smallest of them can differ from the largest one only by less than the dimension of V. The coefficients are bounded too.
This proves the claim. The same argument can be used to prove that for the mapping with all eigenvalues with absolute value strictly greater than one leads to unbounded growth of all coordinates for the iterations jk(y).
The remainder of this part of the third chapter is devoted to the proof of the Jordan theorem and a few necessary lemmas. It is much more difficult than anything so far. The reader can skip it, until the beginning of the fifth part of this chapter in case of any problems with reading it.
3.4.12. Root spaces. We have already seen by explicit examples that the eigensubspaces completely describe geometric properties for some linear mappings only. Thus we now introduce a more subtle tool, the root subspaces.
Definition. A non-zero vector u £ Vis called a root vector of the linear mapping p : V —> V, if there exists an a e K and an integer k > 0 such that (<p — a idy)fe(w) = 0. This means that the fc-th iteration of the given mapping sends u to zero. The set of all root vectors corresponding to a fixed scalar A along with the zero vector is called the root subspace associated with the scalar A e K. We denote it by 7l\.
If u is a root vector and the integer k from the definition is chosen as the smallest possible one for u, then (ip — a idy)fe_1(u) is an eigenvector with the eigenvalue a. Thus we have 7l\ = {0} for all scalars A which are not in the spectrum of the mapping p.
Proposition. Let p : V —> V be a linear mapping. Then
(1) TZ\ C V is a vector subspace for every A G K,
(2) for every A, \i G K, the subspace TZ\ is invariant with respect to the linear mapping (p — fi idy). Inparticular 1Z\ is invariant with respect to p,
(3) if fi 7^ A, then (p — fi idy)|-R.A is invertible,
(4) the mapping (p — A idy)|-R.A is nilpotent.
Proof. (1) Checking the properties of the vector vector subspace is easy and is left to the reader.
(2) Assume that (p — A idy)fe(u) = 0 and put v = (p — [i idy)(u). Then
(p-\idv)k(v) =
= (p - A idv)k((p - A idy) + (A - fi) idy)(u)
= (p-\ idy)fe+1(u) + (A - n) ■ (p - A idy)fe(u) = 0
(3) If u e Ker(p — fi idy) |uA, then
(<p — X idy)(u) = (p — p, idy)(u) + (/i — A) u = (n — A) u.
This implies 0 = (p — X idy)fe(u) = (p — X)k u and thus also u = 0 for A =^ fi.
185
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(4) Choose a basis e\,..., ep of the subspace TZ\. By definition, there exist integers k{ such that (p — A idy )k' (e,) = 0. In particular, the entire mapping (ip — A idy)|-R.A must be nilpotent. □
3.4.13. Quotient spaces. Our next aim is to show that the dimension of the root spaces always equals the algebraic multiplicity of the corresponding eigenvalues. First, we introduce some general useful technical tools.
Quotient spaces
Definition. Let U C V be a vector subspace. Define an equivalence relation on the set of all vectors in V by v1 ~ v2 if and only if v± — v2 G U. Axioms of equivalence are easy to check. The set V/U of the classes of this equivalence is equipped by the operations denned by using representatives. That is, for classes [u] and [v] determined by the vectors u and v, set [v] + [w] = [v + w], a [u] = [an], This is a well denned vector space called the quotient vector space of the space V by the subspace U.
Check the correctness of the definition of the operations and verify all axioms of the vector space in detail!
The classes (vectors) in the quotient space V/ U will often be denoted as formal sums of one representative with all vectors in the subspace U, for instance u+U e V/U, u e V. The class 0 + U is the zero vector in V/U, i.e. the vector u e V represents the zero element in V/U if and only if ueU.
Trivial examples are V/{0} ^ V, V/V ^ {0}. Another example is the quotient space of the plane R2 factored by any one-dimensional subspace (here, every one-dimensional sub-space U C R2 is a line passing through the origin). Then the equivalence classes are all the lines parallel to this line.
Proposition. Let U C V be a vector subspace and (ui,..., un) be a basis of V, such that ..., uk) is a basis ofU. Then dim V/U = n — k and the vectors
Uk+i + U,...,un + U
form a basis of V/U.
Proof. V = span{ui,... ,un}, so V/U = span{ui + U,...,un + U}. But the first k generators are zero, thus V/U = spanjufc+i + U,..., un + U}. Assume that the linear combination ak+i (ufc+i + U) + ■ ■ ■ + an (un + U) =
(afc+i Ufc+iH-----ha„ un)+U = 0 £ V/U vanishes. Equiva-
lently, this linear combination of the vectors uk+i,..., un belongs to the subspace U. Since U is generated by the remaining vectors in the basis of V, the latter linear combination is necessarily zero, and so all coefficients a{ are zero. This proves the linear independence of the generators of V/ U. □
186
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.4.14. Induced mappings on quotient spaces. Assume
prove the following lemma:
Lemma. (1) the mapping p induces a linear mapping
'Pv/u '■ V/U -> V/U, ipV/u(v+U) = p(v)+Uwiththe matrix D under the induced basis uk+i + U,... ,un + U on V/U,
(2) the characteristic polynomial of pv/u divides the characteristic polynomial of p.
Proof. ¥oxv,w e V,u e U,a e K we have p(v+u) e ip(v)+U (because (7 is invariant), (ip(v)+U)+(ip(w)+U) = p(v + w) + U and a (<p(v) + U) = a p(y) + U = ip(av) + U (because p is linear), thus the mapping pv/u is well-defined and linear. Moreover the very definition of the matrix of a mapping in a basis implies that the matrix of pv/u in the induced basis on V/ U is exactly the matrix D (when counting the images of the basis elements the coefficients of the matrix C add only to the class U).
The characteristic polynomial of the induced mapping Pv/ u is thus D—A E\, while characteristic polynomial of the original mapping p is \A — XE\ = \B — \E\\D — XE\. □
Corollary. Let V be a vector space over K of dimension n and let p : V —> V be a linear mapping whose spectrum contains n elements (that is, all roots of the characteristic polynomial lie in K and we count their multiplicities). Then there exists a sequence of invariant subspaces {0} = Vq c Vi (z ■ ■ ■ (z Vn = V with dimensions dim Vi = i. Consider a basis ui,...,un of the space V such that Vi = span{ui,..., Ui}. In this basis, the matrix of the mapping p is an upper triangular matrix:
with the spectrum Ai,..., Xn on the diagonal.
Proof. The subspaces Vi are constructed inductively. Let {Ai,..., A„} be the spectrum of the mapping p. Thus the characteristic polynomial of the mapping p is of the form ±(A - Ai)---(A - A„). We choose V0 = {0}, Vi = span{ui}, where u\ is an eigenvector with eigenvalue Ai. According to the previous theorem, the characteristic polynomial of the mapping Pv/Vi is of the form ±(A — A2) ■ ■ ■ (A — A„). Assume that we have already constructed linearly independent vectors ui,..., uk and invariant subspaces Vi = span{ui..., Ui}, i = 1,..., k < n such that the characteristic polynomial of pv/vk is of the form ±(A - Afc+i) ■ ■ ■ (A — A„) and p(ui) e (A, ■ u{ + Vi-i) for all i = 1,..., k.
187
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
We want to add one more vector u^+i with analogous properties. There exists an eigenvector uk+i +14 G V/Vk of the mapping pv/vk with the eigenvalue A^+i. Consider the space 14+i = span{ui,..., Ufc+i}. If the vector Ufc+i is a linear combination of the vectors u±,... ,uk then uk+i + 14 would be the zero class in V/14. But this is not possible. Thus dim 14+1 = k + 1. It remains to study the induced mapping py/vk+1 ■ The characteristic polynomial of this mapping is of degree n — k — 1 and divides the characteristic polynomial of the mapping p. But completing the vectors ui,..., Uk+i to the basis of V yields a block matrix of the mapping p with an upper triangular submatrix B in the left upper corner and zero in the left lower corner. The diagonal elements are exactly the scalars Ai,..., A^+i. Therefore the roots of the characteristic polynomial of the induced mapping have the required properties. □
Remark. If V decomposes into the direct sum of eigensub-spaces for p, the latter results do not say anything new. But their significance consists in the fact, that only the existence of dim V roots of the characteristic polynomial (counting multiplicities) is assumed. This is ensured whenever the field K is algebraically closed, for instance the complex numbers C. As a direct consequence we see that the determinant and the trace of the mapping p are always the product and the sum of the elements in the spectrum, respectively.
This can be also used for all real matrices. Just consider them to be complex, calculate the determinant or the trace as the product or sum of eigenvalues and because both determinant and the trace are algebraic expressions in terms of the elements of the matrix, the results will be correct.
3.4.15. Orthogonal triangulation. If we are given a scalar product on a vector space V and U C V is a subspace, then clearly V/U ~ U1- where v e U1- is identified with v + U. Moreover, each class of the quotient space V/U contains exactly one vector from U1- (the difference of two such vector is in U n U±). We can exploit this observation in every inductive step of the proof of the theorem above. Choose the representative uk+i £ of the eigenvector of pv/vk ■ This modification leads to the orthogonal basis with the properties required in the claim about triangulation in the corollary above. Therefore there exists such an orthonormal basis, and we arrive at a very important theorem:
Schur's orthogonal triangulation theorem
Theorem. Let p : V —> V be a linear mapping on a vector space with scalar product. Let there be m = dim V eigenvalues, counting multiplicities. Then there exists an orthonormal basis of the space V such that the matrix of p in this basis is upper triangular with eigenvalues Ai,..., Am on the diagonal.
3.4.16. Theorem. Let p : V —> V be a linear mapping and Ai,..., Afc be all distinct eigenvalues. Then the sum of the
188
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
root spaces TZ\1,..., H\k is direct. Furthermore, for every eigenvalue X the dimension of the subspace TZ\ equals the algebraic multiplicity of X.
Proof. We prove first the independence of nonzero vectors from different root spaces. We proceed by induction over the number k of root spaces. The claim is obvious if k = 1. Assume that the theorem holds for less than k > 1 spaces and assume that vectors «i G TZ\1,uk e 7l\k satisfy ui + ■ ■ ■ + Uk = 0. Then, (ip — Xk idy)-7^) = 0 for suitable j, and moreover all = (ip—Xk idy)J(uj) are non-zero vectors in 7l\t, i = 1,... ,k — 1, whenever u{ are non-zero by Proposition 3.4.12. But at the same time
vi h-----1- yk-i = (<p - xk ■ idy y I ^ ui ) =0
and, according to the inductive assumption, all are zero. But then also all u{, 1 < i < k must vanish and thus uk = 0, too. This proves the first claim.
It remains consider the dimensions of the root spaces 71 \. Consider an eigenvalue A of ip, use the same notation ip for the restriction <p\nx and write ip '■ V/1Z\ —> V/TZ\ for the mapping induced by ip on the quotient space.
Assume that the dimension 7l\ is strictly smaller than the algebraic multiplicity of the root A of the characteristic polynomial. In view of lemma 3.4.14, A is also an eigenvalue of the mapping ip. Let (v + 7l\) e V/7l\ be the corresponding eigenvector, that is, ip(v + 7l\) = A (v + TZ\). Then v <£ 7l\ and ip(v) = Xv + w for suitable w e TZ\. Thus w = (ip — X idy)(w) and (ip — X idy)J(w) = 0 for suitable j. We conclude that (ip — X idy (v) = 0, which contradicts the choice v <£ 7l\.
It follows that the dimension of 7l\ equals the algebraic multiplicity of the root A of the characteristic polynomial of the mapping ip : V —> V. □
Combining the latter theorem with the triangulation result from Corollary 3.4.14, we can formulate:
Corollary. Consider a linear mapping <p : V —> V on a vector space V over scalars K, whose entire spectrum is in K. Then V = TZ\1 ffi ■ ■ ■ ffi TZ\„ is the direct sum of the root subspaces. If we choose suitable bases for these subspaces, then under this basis <p has block-diagonal form with upper triangular matrices in the blocks and eigenvalues Xi on the diagonal.
3.4.17. Nilpotent and cyclic mappings. Now almost everything is prepared for the discussion about canonical forms of matrices. It only remains to clear the relation between cyclic and nilpotent mappings and combine already proved results.
Theorem. Let <p : V —> V be a nilpotent linear mapping. Then there exists a decomposition of V into a direct sum of subspaces V = V\ ffi ■ ■ ■ ffi Vk such that the restriction of <p to each summand Vi is cyclic.
189
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Proof. We provide a straightforward construction of a basis of the space V such that the action of the mapping p on the basis vectors directly shows the de-composition into the cyclic mappings, 'i      Let k be the degree of nilpotency of the mapping p and write Pi = lva{ipl), i = 0,..., k. Thus,
{0} = Pk c Pfc_i c ■ ■ ■ c Pi c P0 = V.
Choose a basis ek~1,..., ek~^ of the space Pk-i, where pk-i > 0 is the dimension of Pk-i. By definition, Pk-i C Ker p, i.e. p(ek~1) = 0 for all j.
Assume that Pk-i ^ V. Since Pk-i = p(Pk-2), there necessarily exist the vectors e^-2, j = 1,... ,pk-i in Pt_2, such that p(ek~2) = e^-1. Assume
aiei_1 + ' ■■ + aPk_1ekp;}1+b1ek-2 + -- ■ + bPk_1ek;2i = 0.
Applying p on this linear combination yields &ie^_1 + • • • + ! ep~\ = 0. This is a linear combination of independent vectors, therefore all bj = 0. But then also a, = 0. Thus the linear independence of all 2pk_1 chosen vectors is established. Next, extend them to a basis
(1) 1 '•••'e^-
k   2 A:   2     A:   2 A: 2
el     ' ' ' ' ' ePfc-l'   Pfc-l+1'' ' ' ' Pfc-2
of the space Pk-2- The images of the added basis vectors are in Pk-i - Necessarily they must be linear combinations of the basis elements ek_1,..., e^"^. We can thus adjust the chosen vectors ek~2i+1,..., ek~22 by adding the appropriate
linear combinations of the vectors ek~2,..., ek~2i with the result that they are in the kernel of p. Thus we may assume our choice in the scheme (1) has this property.
Assume that we have already constructed a basis of the subspace Pk-e such that we can directly arrange it into the scheme similar to (1)
pk-l
el '•••'ePfc-i
pk—2 A:—2   pk—2 A—2
el     ' ' ' ' ' ePk-l' epfc_i + l' ' ' ' ' ePfc-2
A—3 k—3     k—3 k—3   pA—3 3
cl     ' ' ' ' ' cpfc-i' cpfc-i + l' ' ' ' ' cpfc-2' cpfc-2 + l' ' ' ' ' cPk-3
a-£       a-£   a-£ k-e   k-e k-e
where the value of the mapping p on any basis vector is located above it. The value is zero if there is nothing above that basis vector.
If Pk-e  7^  V, then again there must exist vectors
ek-e-\ek-^ which map to ek~e,ek~i. We can
extend them to a basis Pk-i-±, say, by the vectors
k-e-i k-e-i Pk-i+i>'''' Pk-i-i•
Again, exactly as when adjusting (1) above, we choose the additional basis vectors from the kernel of p. and analogically as before we verify that we indeed obtain a basis for
Pk-e-i-
190
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
After k steps we obtain a basis for the whole V, which has the properties given for the basis of the subspace Pk-i. Individual columns of the resulting scheme then generate the subspaces Vi. Additionally we have found the bases of these subspaces which show that corresponding restrictions of p are cyclic mappings. □
3.4.18. Proof of the Jordan theorem. Let Ai,..., A^ be
all the distinct eigenvalues of the mapping p. From the assumptions of the Jordan theorem it follows that
v = nXl ®---®nXk.
The mappings p{ = (<p\nXi - \ id%A.) are nilpotent and thus each of the root spaces is a direct sum
Tlx, =Pi,Xi Pj„x,
of spaces on which the restriction of the mapping p — A, idy is cyclic. Matrices of these restricted mappings on PTtS are Jordan blocks corresponding to the zero eigenvalue, the restricted mapping p\pTi3 has thus for its matrix the Jordan block with the eigenvalue A,.
For the proof of Jordan theorem it remains to verify the claim about uniqueness (up to reordering the blocks). Because the diagonal values A, are given as roots of the characteristic polynomial, their uniqueness is immediate. The decomposition to root spaces is unique as well. Thus, without loss of generality we may assume that there is just one eigenvalue A and we are going to express the dimensions of individual Jordan blocks using the ranks rk of the mapping (p — A idy )k. This will show that the blocks are uniquely determined (up to their order). On the other hand, changing the order of the blocks corresponds to renumbering the vectors of basis, thus we can obtain them in any order.
If ip is a cyclic operator on an m-dimensional space, then the defect of the iterated mapping ipk is k for 0 < k < m, while the defect is m for all fc > m. This implies that if our matrix J of the mapping p on the n-dimensional space V (remind we assume V = 7l\) contains dk Jordan blocks of the order k, then the defect De = n — of the matrix (J — \E)e is
Df = d1 + 2d2 H-----\-£de+ £de+1 H----.
Now, taking the combination 2Dk — Dk-i — Dk+i we cancel all those terms in the latter expression which coincide for £ = k — 1, k, k + 1 and we are left with
Substituting for ZVs, we finally arrive at
dk = 2n-2rk-n + rk_1-n + rk+1 = rk-1-2rk + rk+1.
This is the requested expression for the sizes of the Jordan blocks and the theorem is proved.
191
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.4.19. Remarks. The proof of the theorem about the exis-*Sw tence of the Jordan canonical form was construe-/>i>~v.--p. tive> but it does not give an efficient algorithmic ''^jr"-- approach for the construction. Now we show .. how our results can be used for explicit computation of the basis in which the given mapping p : V —> V has its matrix in the canonical Jordan form.5
(1) Find the roots of the characteristic polynomial.
(2) If there are less than n = dim V roots (counting multiplicities), then there is no canonical form.
(3) If there are n linearly independent eigenvectors, there is a basis of V composed of eigenvectors under which p has diagonal matrix.
(4) Let A be the eigenvalue with geometric multiplicity strictly smaller than the algebraic multiplicity and vi,...,Vk be the corresponding eigenvectors. They should be the vectors on the upper boundary of the scheme from the proof of the theorem 3.4.17. We need to complete the basis by application of iterations p — A idy. By doing this we also find in which row the vectors should be located. Hence we find the linearly independent solutions w{ of the equations (p — X id)(wi) = Vi from the rows below it. Repeat the procedure iteratively (that is, for w{ and so on). In this way, we find the "chains" of basis vectors that give invariant subspaces, where p — X id is cyclic (the columns from the scheme in the proof).
The procedure is practical for matrices when the multiplicities of the eigenvalues are small, or at least when the degrees of nilpotency are small. For instance, for the matrix
A =
we obtain the two-dimensional subspace of eigenvectors
span{(l,0,0)T,(0,l,0)T},
but we still do not know, which of them are the "ends of the chains". We need to solve the equations (A — 2E)x = (a, b, 0)T for (yet unknown) constants a, b. This system is solvable if and only if a = b, and one of the possible solutions is x = (0,0,1)T, a = b = 1. The entire basis is then composed of (1,1,0)T, (0,0,1)T, (1,0,0)T. Note that we have free choices on the way and thus there are many such bases.
There is a beautiful purely algebraic approach to compute the Jordan canonical form efficiently, but it does not give any direct information about the right basis. This algebraic approach is based on polynomial matrices and Weierstrass divisors. We shall not go into details in this textbook.
192
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
5. Decompositions of the matrices and pseudoinversions
Previously we concentrated on the geometric description of the structure of a linear mapping. Now we translate our results into the language of matrix decomposition. This is an important topic for numerical methods and matrix calculus in general.
Even when computing effectively with real numbers we use decompositions into products. The simplest one is the unique expression of every real number in the form
a = sgn(a) ■ |a|,
that is, as a product of the sign and the absolute value. Proceeding in the same way with complex numbers, we obtain their polar form. That is, we write z = (costp + i smip)\z\. Here the complex unit plays the role of the sign and the other factor is a non-negative real multiple.
In the following paragraphs we list briefly some useful decompositions for distinct types of matrices. Remind, we met suitable decompositions earlier, for instance for positive semidefinite matrices in paragraph 3.4.9 when finding the square roots. We shall start with similar simple examples.
3.5.1. LU-decomposition. In paragraphs 2.1.7 and 2.1.8 we transformed matrices over scalars from any field into row echelon form. For this we use elementary row transformations, based on successive multiplication of our matrix by invertible lower triangular matrices Pi. In this way we add multiples of the rows above the currently transformed one.
Sometimes we interchange the rows, which corresponds to multiplication by a permutation matrix. That is a square matrix in which all elements are zero except exactly one value 1 in each row and column. To imagine why, consider a matrix with just one non-zero element in the first column but not in the first row. If we want to obtain the matrix blockwise in the form
then we may need to interchange columns as well. This is achieved by multiplying by a permutation matrix from the right hand side.
For simplicity, assume we have a square matrix A of size m and that Gaussian elimination does not force a row interchange. Thus all matrices Pi can be lower triangular with ones on diagonal. Finally we note that inverses of such Pi are again lower triangular with ones on the diagonal (either remember the algorithm 2.1.10 or the formula in 2.2.11). We obtain
U = P ■ A = Pk ■ ■ ■ P1 ■ A where U is an upper triangular matrix. Thus
A = L-U
193
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
where L is lower triangular matrix with ones on diagonal and U is upper triangular. This decomposition is called LU-decomposition of the matrix A. We can also absorb the diagonal values of U into a diagonal matrix D and obtain the LDU-decomposition where both U and L have just ones along the diagonal, A = L D U.
For a general matrix A, we need to add the potential permutations of rows during Gaussian elimination. Then we obtain the general result. (Think why we can always put the necessary permutation matrices to the most left and most right positions!)
LU-decomposition
Let A be any square matrix of size m over a field of scalars. Then we can find lower triangular matrix L with ones on its diagonal, upper triangular matrix U and permutation matrices P and Q, all of size m, such that
A = P-L-U-Q.
3.5.2. Remarks. As one direct corollary of the Gaussian tT ST*' elimination we can observe that, up to a choice |,j^pp7 of suitable bases on the domain and codomain, CmJ*dr::_ every linear mapping / : V —> W is given by a matrix in block-diagonal form with unit matrix of the size equal to the dimension of the image of /, and with zero blocks all around. This can be reformulated as follows: every matrix A of the type ra/n over a field of scalars K can be decomposed into the product
where P and Q are suitable invertible matrices.
Previously (in 3.4.10) we discussed properties of linear mappings / : V —> V over complex vector spaces. We showed that every square matrix A of dimension m can be decomposed into the product
A = P- J-P~\
where J is a block-diagonal with Jordan blocks associated with the eigenvalues of A on the diagonal. Indeed, this is just a reformulation of the Jordan theorem, because multiplying by the matrix P and by its inverse from the other side corresponds in this case just to the change of the basis on the vector space V (with transition matrix P). The quoted theorem says that every mapping has Jordan canonical form in a suitable basis.
Analogously, when discussing the self-adjoint mappings we proved that for real symmetric matrices or for complex Hermitian matrices there exists a decomposition into the product
A = P ■ D ■ P*,
where D is the diagonal matrix with all (always real) eigenvalues on the diagonal, counting multiplicities. Indeed, we
194
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
proved that there is an orthonormal basis consisting of eigenvectors. Thus the transition matrix P reflecting the appropriate change of the basis must be orthogonal. In particular,
p~1 — p*
For real orthogonal mappings we derived analogous expression as for the symmetric ones, i.e. A = P ■ B ■ P*. But in this case the matrix B is block-diagonal with blocks of size two or one, expressing rotations, mirror symmetry and identities with respect to the corresponding subspaces.
3.5.3. Singular decomposition theorem. We return to general linear mappings / : V —> W between vector spaces (generally distinct). We assume that scalar products are defined on both spaces and we restrict ourselves to orthonormal bases only. If we want a similar decomposition result as above, we must proceed in a more refined way than in the case of arbitrary bases. But the result is surprisingly similar and strong:
Singular decomposition
Theorem. Let Abe a matrix of the type m/n over real or complex scalars. Then there exist square unitary matrices U and V of dimensions m and n, and a real diagonal matrix D with non-negative elements of dimension r,r < min{m, n}, such that
A = USV*,   S=(^ °qJ
and r is the rank of the matrix AA*.
The matrix S is determined uniquely up to the order of the diagonal elements in D. Moreover, are the square roots of the positive eigenvalues    of the matrix A A*.
If A is a real matrix, then the matrices U and V are orthogonal.
Proof. Assume first that m   <   n.    Denote by <p   :   I"   —>   Km the mapping between i, real or complex spaces with standard scalar products, given by the matrix A in the standard bases.
We can reformulate the statement of the theorem as follows: there exists orthonormal bases on K" and Km in which the mapping p is given by the matrix S from the statement of the theorem.
As noted before, the matrix A* A is positive semidefinite. Therefore it has only real non-negative eigenvalues and there exists an orthonormal basis w of K" in which the corresponding mapping p* o p is given by a diagonal matrix with eigenvalues on the diagonal. In other words, there exists a unitary matrix V such that A* A = V B V* for a real diagonal matrix B with non-negative eigenvalues (di, d2,..., dr, 0,..., 0) on the diagonal, d{ ^ 0 for alii = 1,..., r. Thus
B = \   A A \ (AV)*(AV).
195
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
This is equivalent to the claim that the first r columns of the matrix AV are orthogonal, while the remaining columns vanish because they have zero norm.
Next, we denote the first r columns of AV as vi,...,vr £ Km. Thus, (vi,Vi) = di, i = l,...,r, and the normalized vectors u{ = ~^Vi form an orthonormal system of non-zero vectors. Extend them to an orthonormal basis u = ui,..., um for the entire Km. Expressing the original mapping p in the bases w of I" and u of Km, yields the matrix VB. The transformations from the standard bases to the newly chosen ones correspond to the multiplication from the left by a unitary (orthogonal) matrix U and from the right by V~x = V*. This is the claim of the theorem.
If m > n, we can apply the previous part of the proof to the matrix A* which implies the desired result.
All the previous steps in the proof are also valid in the real domain with real scalars. □
This proof of the theorem about singular decomposition is constructive and we can indeed use it for computing the unitary (orthogonal) matrices U and V and the non-zero diagonal elements of the matrix S.
The diagonal values of the matrix D from the previous theorem are called singular values of the matrix A.
3.5.4. Further comments. When dealing with real scalars, the singular values of a linear mapping ip : Rn —> Rm have a simple geometric meaning:
Let K c W1 be the unit ball in the standard scalar product. The image <p(K) is always an m-dimensional ellipsoid (possibly degenerate). The singular values of the matrix A are then the norms of the main half-axes. The theorem says further that the original ball allows an orthogonal set of diameters, whose images are exactly the half-axes of this ellipsoid.
For square matrices it can be seen that A is invertible if and only if all singular values are non-zero. The ratio of the greatest to the smallest singular value is an important parameter for the robustness of many numerical computations with matrices, for instance the computation of the inverse matrix. Note that there are fast methods of computation (approximations) for eigenvalues. Thus the singular decomposition is a very effective tool to work with.
3.5.5. Polar decomposition theorem. The singular decomposition theorem is the starting point for many other useful tools. We present several
_ C: direct corollaries (which by themselves are non-trivial and important).
The statement of the singular decomposition theorem saying that for any matrix A, real or complex, A = USW* with S diagonal with non-negative real numbers on the diagonal and U and W unitary, can be rephrased as
A = USU*UW*
196
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
and let us denote P = USU*, V = UW*. The first of the matrices, P, is Hermitian (in the real case, symmetric) and positive semidefinite (because P and S are matrices of the the same mapping in different orthonormal bases). At the same time, V is the product of two unitary matrices and thus again is unitary (in the real case orthogonal).
Next, assume that A = PV = QZ are, two such decompositions of the matrix A into the product of a positive semidefinite Hermitian matrix and a unitary matrix. Clearly, A* = WSU*. Thus AA* = USSU* = P2, and the matrix P is actually the square root of the easily computable Hermitian matrix AA*. In particular, this proves that P is uniquely determined, cf. 3.4.9.
Further, assume that A is invertible. Then also P is in-vertible and Z = V = P'1 A.
We have derived a very useful analogy of the decomposition of a real number into a sign and the absolute value:
Polar decomposition
Theorem. Every square complex matrix a of dimension n can be expressed in the form a = PV, where P is a Hermitian positive semi-definite square matrix of the same dimension while V is unitary.
The matrix P = Vaa* is uniquely given, and if a is invertible, the decomposition is unique and V =
{Vaa^^a.
If a is a matrix of real scalars, then P is symmetric and V is orthogonal.
If we apply the same theorem to A* instead of A, we obtain the same result, but with the order of the Hermitian and unitary matrices is reversed. This means A = VP with V unitary and P = \JA*A positive semidefinite. The matrices in the corresponding right and left polar decompositions will in general be different.
Actually, if A is invertible, it is easy to check, that the matrices in the left are polar decomposition coincide if and only if A is normal. Look at theorem 3.4.8 and verify it yourself.
In the complex case the analogy with the decomposition
of numbers is even more entertaining. The positive semidefinite P again plays the role of the absolute value of the complex number. The unitary matrix V uniquely allows the expression as
a sum V = re V + i iraV with Hermitian real and imaginary parts and the property (re V)2 + (im V)2 = E. We obtain a full analogy for the polar form for the complex numbers (see the final remark and corollary in 3.4.8). But note that in the higher dimensional case, it is important in which order this "polar form" of matrix is written. It is possible in both ways, but the results differ in general.
3.5.6. QR decomposition. For many practical applications it is faster to use another decomposition of matrices, which is an analogy of the Schur orthogonal triangulation theorem:
197
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
QR decomposition
Theorem. For every complex matrix A of the type m/n there exists a unitary matrix Q and an upper triangular matrix R such that A = QR.
If all the scalars are real, then both Q and R are real (i.e. Q orthogonal).
Proof. In the geometric formulation we need to prove jj> 11 that for every mapping ip : K" —> Km with the ma-"!y>3^ trix A in the standard bases we can choose a new or-As^F thonormal basis on Km for which p has upper trian-?w   gular matrix.
Consider the images p(e1),..., p(en) e Km of the vectors of the standard orthonormal basis of K". Choose from them a maximal linearly independent system v1,..., vk in such a way that the removed dependent vectors are always a linear combination of the previous vectors. Extend it into a basis v1,... ,vm. Let ui,..., um be an orthonormal basis Km obtained by the Gramm-Schmidt orthogonalization of this system of vectors.
For every e{, p(ei) is either one of Vj, j < i, or it is a linear combination of v1,..., vi_1. Therefore in the expression of p(ei) in the basis u only the vectors ui,..., u{ appear. Thus, in the standard basis on I" and u on Km, the mapping p has an upper triangular matrix R. The change of the basis u on Rm corresponds to the multiplication by a unitary matrix Q* from the left. That is, R = Q*A, equivalent^ A = QR.
The last claim is clear from the construction. □
3.5.7. Pseudoinversions. Finally, we discuss an especially useful and important extension of the inversion concept, which is of great importance for nu-5=^§£?2S merical procedures and also in Statistics. Technically, the following quite straightforward application of singular decompositions of matrices allows us to define the pseudoinverse. However, we should beware that the singular decomposition is not unique and thus we must verify that such a definition is consistent. We shall see that in the next theorem.
Pseudoinverse matrices
Let A be a real or complex matrix of the type m/n. Let
'D
A = USV*, S-.
0 0
be its singular decomposition (in particular, D is invertible). The matrix
0 0y
is called the pseudoinverse matrix of the matrix A.
AU.= V$U*, =
In geometric terms, we may view the linear mapping p given by the matrix A in the two special orthonormal basis, where p has got the matrix S with non-negative diagonal entries. We take the inverse of the "invertible part" of p and
198
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
complete it trivially to the pseudoinverse6 mapping p^. The result is then viewed in the original basis and yields A*.
As the following theorem shows, the pseudoinverse is an important generalization of the notion of inverse matrix, together with direct applications. At the same time, property (3), together with property (2), verifies the appropriateness of the definition.
Properties of pseudoinverse matrices
Theorem. Let A be a real or complex matrix of the type ra/n and let A^ be its pseudoinverse. Then:
(1) if A is invertible (necessarily square), then
Al=A~\
(2) A^A and AA^ are Hermitian (in real case symmetric) and
AA^A = A, A^AA^Al
(3) the pseudoinverse matrices A^ are uniquely defined by the four properties from (2). Thus if some matrix B of the type n x m has the properties that BA and AB are both Hermitian, ABA = A and BAB = B, then B = Ai.
(4) if A is a matrix of the system of linear equations Ax = b with b G Km, then the vector y = A^b G Kn minimizes the norm \\Ax — b\\ for all vectors x G Kn.
(5) the system of linear equations Ax = b with b G Km is solvable if and only if AA^b = b. In this case all solutions are given by the expression
x = A*b+ (E-A^A)u,
where u G Kn is arbitrary.
Proof. (1): If A is invertible, then the matrix S =
AU*AV is also invertible and directly from the definition M = S'1. Consequently, A^A = AA^ = E.
(2): Direct computation yields SSft S = S and M      = M, therefore
AA^A = USV*V$ U*USV* = UStf SV* = USV* = A and analogically for the second equation. Furthermore,
(AA^y = (US&U*)* = U(tf)*S*U*
= U(Stf)*U* = ustfu* = AAK It can be proved similarly that (A^A)* = A^A.
6Thls concept was introduced by Eliakim Hastings Moore, an American mathematician, around 1920. It was reinvented by Roger Penrose and others later. In the literature, it is often called the Moore-Penrose pseudoinverse. Roger Penrose is an extremely influential mathematical physicist and philosopher of science working in Oxford, known also for his many best-selling popular books such as The Emperor's New Mind: Concerning Computers, Minds, and The Laws of Physics (1989); Shadows of the Mind: A Search for the Missing Science of Consciousness (1994); The Road to Reality: A Complete Guide to the Laws of the Universe (2004); Cycles of Time: An Extraordinary New View of the Universe (2010).
199
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
(3) The claim can be proved by direct computation. Of course we can consider the matrices A, A>, B as the matrices of the mappings p, p^, and ip in the standard bases on K" and Km, or any other pair of orthonormal bases. The requested equality is equivalent to the equality p^ = ip independently of the choice of the bases. We choose a couple of orthogonal bases from the singular decomposition of A. Then the mapping p has the matrix S from the definition of the pseu-doinverse A>, so we write directly
A-(°   °]   A^-i0'1 °
A-{o o)>A-{o 0
with the diagonal matrix D consisting of all non-zero singular values. We write again B for the matrix of ip in these bases. Clearly B and A satisfy the assumptions of the claim (3). Thus
AU=(o ty>ABA = A
and we obtain
*-*(f SM» S)-(V o
Consequently
' D~x P 2 R
for suitable matrices P, Q and R. Next,
B
BA =
D-1   P\ (D   0\ _ ( E 0 Q    RJ \0   01 ~ \QD 0
is Hermitian. Thus QD = 0 which implies Q = 0 (the matrix D is diagonal and invertible). Analogously, the assumption that AB is Hermitian implies that P is zero. Finally, we compute
B = BAB=  (D~X   °^fD °^
0     RJ \0   OJ \ 0 R
On the right side in the right-lower corner there is zero, and thus also R = 0 and the claim is proved.
(4): Consider the mapping p : K" —> Km, x i-> Ax, and direct sums I" = (Keri^)±ffiKeri^,Km = Im^e(Im^)1. The restricted mapping p := <p\(Keiv)± : (Kerp)1- —> Imp is a linear isomorphism. If we choose suitable orthonormal bases on (Ker p)1- and Im p and extend them to orthonormal bases on whole spaces, the mapping p will have matrix S and p the matrix D from the theorem about the singular decomposition. In the next section, we shall discuss in detail that for any given b e Km, there is the unique vector which minimizes the distance ||& — z\\ among all z £ Imp (in analytic geometry we shall say that the point z realises the distance of b from the affine subspace Im p), see 4.1.16). The properties of the norm proved in theorem 3.4.3 directly imply that this is exactly the component z = b\ of the decomposition b = bi + &2. bi £ Imp, &2 £ (Im^)1.
Now, in our choice of bases, the mapping p^ is given by the matrix 5^ from the singular decomposition theorem. In
200
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
particular, ip^(lmip) = (Kerip)±, D 1 is the matrix of the restriction <P\Iinip> m^ ^[(im^)-1- *s zer0, frdeed,
if o <p^(b) = <p(<p^(z)) = z
and the proof is finished.
(5) Evidently, the equality Ax = b, with fixed, implies
b = Ax = AA^Ax = AA^b.
Thus the condition is necessary. On the other hand, if this condition holds, then the choice x = Aj(b+ (E — Aj(A)u as in (5) implies
Ax = A(AJfb +(E- A^A)u) = b+(A- AA^ A)u = b.
The rank of the matrix E — A^A gives the correct size of the image of the corresponding mapping according to the Kronecker-Capelli theorem (cf. 2.3.5) about the solution of the system of linear equations, and thus we obtain all solutions in this way. □
Remark. Notice that the last computation in the proof verifies that (E — A^A) is the matrix of the projection of R™ onto the subspaces of all solutions of the homogenous system
Ax = 0.
It can be also shown that the matrix A^ minimizes the square of the norm of the expression
AA^ — E
that is, the sum of squares of all elements of the given matrix.
The claim (4) of the theorem can be also interpreted as follows. AA^ is the matrix of the orthogonal projection form the vector space Rm, onto the subspace generated by the columns of the matrix A (m is the number of the rows of the matrix A). This interpretation has a strong meaning for matrices having more rows than columns. Moreover, for matrices A whose columns are independent vectors, the expression (ATA) ~1AT makes sense and it is not hard to verify that this matrix satisfies all the properties from (1) and (2) from the previous theorem. Thus it is the pseudoinverse A* of the matrix A.
3.5.8. Linear regression. The approximation property (4) from the previous theorem is very useful in the cases where we are to find as good an approximation as possible for the (non-existent) solution of a given system Ax = b, where A is a real matrix of the type ra/n and m > n.
For instance, an experiment gives many measured real values bj, j = 1,..., m. We want to find a linear combination of only a few fixed functions fi, i = 1,..., n which approximates the values bj as good as possible. The actual values of the fixed functions at the relevant points y.j e R define the matrix a{j = fj(yi). The columns of the matrix are given by values of the individual functions fj at the considered points. The goal is to determine the coefficients Xj e R
201
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
so that the sum of the squares of the deviations from the actual values
is minimized. By the previous theorem, the optimal coefficients are A^b.
As an example, consider just three functions fo(y) = 1, /i(y) = V< h{y) — V2- Assume that the "measured values" of their unknown combination g(y) = x0 + x\y + x2y2 in integral values for y between 1 and 5 are bT = (1,12,6,27,33). This vector arose by computing the values 1 + y + y2 at the given points adjusted by random integral values in the range ±10. This leads in our case to the matrix A = (b{j)
/l   1   1    1 1\ AT=   1   2   3   4    5 . \1   4   9   16 25/
The requested optimal coefficients for the combination are
A^ -b
9 5	0
37	23
35	70
1	1
7	14
0.600\	
0.614	
1.214/	
_4 5
6
7 _ 1
7
_3 5 37 70 _ j_ 14
(A
12 6
27 \33/
The resulting approximation can be seen in the picture, where the given values b are shown by the diamonds, while the dashed curve stays for the resulting approximation g(y) = xi + x2y + x3y2.
/
/
V
/
♦ /
The computation was produced in Maple and taking 15 points yi = i, and a random vector of deviations from the same range produced the following picture:
202
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
4.
203
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
G. Additional exercises for the whole chapter
3.G.I.   Solve the following LP problem
minimize {7x — 5y + 3z} 0<x<6,    -2 < y < 7,    -4 < z < 9.
Solution. Introduce new variables y' and z' by setting y = y' — 2, z = z' — 4. In the variables {x, y', z'} the original LP problem takes the form
maximize {—7x + 5y' — 3z'} x<6,   y' < 9,   z' < 13,   x, y', z' > 0. Dropping the primes at y and z, we obtain an initial LP tableaux:
	X	V	z	t	s	r	
Objective	1	-5	3	0	0	0	0
t	1	0	0	1	0	0	6
s	0	®	0	0	1	0	9
r	0	0	1	0	0	1	13
The second and final tableaux, is
	X	V	z	t	s	r	
Objective	1	0	3	0	5	0	45
t	1	0	0	1	0	0	6
V	0	1	0	0	1	0	9
r	0	0	1	0	0	1	13
which provides the solution
x = 0, y' = 9, z' = 0. In the original notation this means
x = 0, y = 7, z = —4.
This solution can be easily guessed from the beginning. □
3.G.2. Solve the following LP problem: A small firm specializes in making five types of spare automotive parts. Each part is first cast from iron in the casting shop and is then sent to the finishing shop where holes are drilled, surfaces are turned, and edges are ground. The required worker-hours (per lOO.units), for each type of parts in each of the two shops, are shown below:
Part type   1   2   3   4 5 Casting    2   13   3 1 Finishing   2   2   1^   1^ F
The profits from the five parts are $30, $20, $40, $25 and $10 (per 100 units), respectively. The capacities of the casting and finishing shops over the next month are 700 and 1000 worker-hours respectively. Determine the quantities of each type of spare part to be made during the month, so as to maximize the firm's profit. Assume that there is sufficient demand for the firm to sell whatever it is capable of producing.
204
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution. Let Xj be the number of produced units (in multiples of ten thousand) of the part of the type j = 1,..., 5. Then the LP problem can be formulated as
maximize {30xi + 20^2 + 40x3 + 25x4 + IOX5}
2xi + x2 + 3x3 + 3x4 + x5 + t = 7
2xi + 2x2 + x3 + x4 + x5 + s = 10 Xj > 0, j = 1,... ,5. The first, second, and third LP tableaux are respectively
	Xi    X2    X3     X4     X5    t s	
Objective	-3   -2   -4   -f   -1   0 0	0
t s	0 13 3 110 2    2    1     1     10 1 Xi     X2     X3    X4    X5     t s	7 10
Objective	0    -i    j    2    ±    t 0	21 2
Xl s	I      I      I      3      1     1 o 1        2        2       2       2       2 u 0    ®    -2   -2    0   -1 1 Xi    X2    X3    X4     X5      t s	7 2 3
Objective	0041 j 1 i	24 2
Xi X2	1 0 § § ± 1 -1 0    O   -2   -2   0-1   1 3 Xl    X2    X3    X4    X5      i s	2
Objective	I ri ri 3 3 6 3 R       U       U       2       5       R in	62 R
23 x2	I 0 1 1 i \ i I       1      0      0      1     -1 5	4 2% 5
The final tableau provides the solution.
23 4 x2 = —, x3 = -, Xj = 0, j = 1,4,5.
Thus, optimal profit is achieved by producing 46,000 parts of type 2 and 8,000 parts of type 3. □
3.G.3. Model of spreading of annual plants. Consider some plants that blossom at the beginning of summer, then produce seeds and die at the peak of summer. Some of the seeds burst into flowers at the end of the autumn. Some survive the winter in the ground and burst into flowers at the start of the spring. The flowers that burst out in autumn and survive the winter are usually larger in the spring, and usually produce more seeds. After this, the whole cycle repeats itself.
The year is thus divided into four parts and in each of these parts we distinguish between some "forms" of the flower:
Part	Stage
beginning of spring	small and big seedlings
beginning of summer	small, medium and big blossoming flowers
peak of summer	seeds
autumn	seedlings and seeds
Denote by x1 (t) and by x2(i) the number of small and large seedlings respectively at the start of the spring in year t. Denote by y± (t), y2 (t) and y3 (t) the number of small, medium and large flowers respectively in the summer of that year. From the
205
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
small seedlings either small or large flowers grow. From the large seedlings either medium or big flowers grow. Each of the seedlings can of course die (weather, be eaten by a cow, etc.) and nothing grows out of it. Denote by b{j the probability that the seedling of the j-th size, j = 1,2 grows into a flower of the i-th size, i = 1,2,3. Then we have
0 < &n < 1,   &i2 =0,   0 < &2i < 1,   0 < b22 < 0,
b3l=0,    0< &32 < 1,&H+&21 < 1,    b22 + b32<\
(think in detail about what each of these inequalities expresses). If we consider classical probability, we can compute &n as a ratio of the positive results (small seedling grows into a small flower) and of all possible results (the number of small seedlings). That is, &n = yi(t)/x1(t), or
yi(t) = bux^t).
Analogously,
yz{t) = b32x2(t).
Denote for a while by y2t± (t) and y2t2 (t) the number of medium flowers that grow out of small and large seedlings respectively.
Theny2(*) = y2,i(t) + y2,2(t) and b2i = y2,i(t)/x1(t),b22 = y2,2(t)/x2(t) and thus
2/2 (*) = b2ixi{t) + b22x2(t).
Write
0    bi2) \y3(t)J and rewrite the previous equation in matrix notation
y(t) = Bx(t).
Denote by en, c\2 and c±3 the number of seeds produced by small, medium and large flowers respectively. Denote by z(t) the total number of produced seeds in the summer of year t. Then
z(t) = cuyi(f) + c12y2(t) + c13y3(t).
In matrix calculus
z{t) = Cy(t)
with the notation
C = (en     C12     C13) .
If we want the matrix C to describe the modelled reality, we assume that the inequalities
0 < Cn < C12 < Ci3
hold.
Finally, denote by w1 (t) and w2 (t) the number of seeds that burst in the autumn and the number of seeds that stay in the ground during the winter respectively. Denote by dn and d21 the probabilities that the seeds burst out in the autumn and that the seeds do not burst respectively. Denote by /n and /22 the probabilities that the seedling and the seed do not die during the winter respectively. The probabilities dn, d2i must satisfy the inequalities
0 < dn,   0 < d21,   dn + d21 = l.
Since a seedling dies in the winter more easily than a seed hidden in the ground, we assume that
0 < fll < J22 < i.
206
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
When denoting
we obtain, with similar ideas as before, the equalities w(t) = Dz(t),      x(t + 1) = Fw(t). Because the matrix multiplication is associative, we can compose the recurrent formulas:
x(t + 1) = Fw{t) = F(Dz(t)) = (FD)z(t)
= (FD)(Cy(t))=(FDC)y(t)
= (FDC)(Bx(t)) = (FDCB)x(t), y(t + 1) = Bx(t + 1) = B(Fw(t)) = (BF)w(t)
= (BF)(Dz(t)) = (BFD)z(t) = (BFD)(Cy(t))
= (BFDC)y(t), z(t + 1) = Cy(t + 1) = C(Bx(t + 1)) = (CB)x(t + 1)
= (CB)(Fw(t)) = (CBF)w(t) = (CBF)(Dz(t))
= (CBFD)z(t), w(t + 1) = Dz(i + 1) = D(Cy(f + 1)) = (DC)y(f + 1)
= (DC)(Bx(t + 1)) = (£>CB)x(f + l)
= (DCB)(Fw(t)) = (DCBF)w(t).
Using the notation
Ax = FDCB, Ay = BFDC, Az = CBFD, Aw = DCBF, we simplify them into the formula
x(t + 1) = Axx{t), y(t + 1) = Ayy(f), z(t + 1) = Azz{t), w{t + l)=Aw{t).
From these formulas we can compute the distribution of the population of the flowers in any part of any year, if we know the starting distribution of the population (that is, in the year zero).
For instance, let the distribution of the population be known in the summer, that is, z(0) of seeds. The distribution of the population at the beginning of the spring in the t-th year is
x{t) =Axx(t - 1) = A\x(t - 2) = • • • = A^xiA) =At-1Fw(0) = A^FDziO).
207
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Note that the matrix Az = CBFD is of the type 1 x 1; it is not a matrix but just a scalar. We can denote by A = Az, and compute
(1)   A = CBFD
(hi 0
= (cil     C12     C13)     hi b22
\ o h2
= (cilhl + C12&21     C12&22 + C13b.
fudu
/22^21/
= hicndnfu + hicwdufu + b22Ci2d21f22
+ h2cnd2\f22 and order the previous computation into a suitable form
x{t) = (FDCB)t~1FDz(0) = FD(CBFD)t~2CBFDz(0) = FDiCBFD'f^ziO) = FDA^ziO) = \t-1FDz{Q).
In this way only two matrix multiplications remain.
We list concrete values of the matrices B,C,D,F; they are the parameters of a hypothetical flower, which were inspired by the actual grass Vulpia ciliata: /0.3    0 \
73=   0.1   0.6 \,C=(1   10   100), D =
F =
0    0.2 /
0.05    0 N 0 0.1,
0.5 0.5
Now we can compute the individual matrices, which map the vector describing the distribution of the population in some vegetative part of the year on the vector of the distribution of the population in the same part of the next year:
/nn„,    n^nn\ /0.0075   0.0750   0.7500\
^=(o 3     )    4,=    0.0325   0.3250 3.2500
V i-ouuuy ^0.0100    0.1000 1.0000/
A-z — 1.3325,   -Ayj —
0.0325 1.3000\ 0.0325   1.3000J
The value of A = Az = 1.3325 expresses the relative increment of the population between two years. Check for yourself that each of the matrices Ax,Ay, Aw has only one non-zero eigenvalue A = 1.3325. The other eigenvalues are equal to 0.
We show one more application of the given model. We are interested in the "flexibility" of the reaction of the relative increment A on the change of the individual "demographic parameters" - for instance, how the change of the probabilities of survival of the seeds changes the yearly increment. We reformulate formulate the question. By the flexibility of the reaction of the characteristic A on the parameter s, denoted by e(A, s), we mean the relative change of the value A related to the relative change of the parameter s. Even more precisely: by A(s) we denote the yearly increment in dependence on the parameter s. Then A\(s) = A(s + As) — \(s) expresses the absolute change of the relative increment A with the absolute change of the parameter s by As. The relative change of the increment of the parameter s is As/s. The flexibility is then the ratio of these two relative changes, that is,
AX(s)/X(s) _   s   X(s + As)-X(s)
e('Sj"     As/s     ~ X(s) As Specifically, the yearly relative increment of the population depending on the survival of the seeds over the winter is, according to(l)
•MJ22) = d2l(&22Cl2+&32Cl3)/22+dll(&llCll/ll+&2lCl2/ll)
208
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
and for specific values of the other parameters
A(/22) = 13/22 + 0.0325. Because /22 = 0.1, we can compute
A(0.1) = 1.3325,   A(0.1 + As) = 1.3325 + 13As,
AX(0.1) = 13 As,
therefore
e(A,0.1) = ^L^£= 0.976.
1.3325 As
Analogically we can compute the flexibility of the reaction of the relative increment A of the population on the other "demographic parameters". The results are summarised in the table
parameter	flexibility	parameter	flexibility
hi	0.006	Cll	0.006
hi	0.019	C12	0.244
h2	0.225	Cl3	0.751
hs	0.750	hi	0.024
du	0.024	I22	0.976
d.21	0.976		
From it we can see that the increment A is mostly influenced by the number of the seeds that overwinter (parameter d2i) and their survivability (parameter /22). This revelation is not surprising. Farmers have been aware of this fact since neolithic times. The result shows that the mathematical model adequately describes the reality.
3.G.4. Consider the following Leslie model in which a farmer breeds sheep. The birth-rate of sheep depends only on their age and on average is 2 lambs per sheep between one and two years of age, 5 lambs per sheep between two and three years of age and 2 lambs per sheep between three and four years of age. Younger sheep do not deliver any lambs. Every year, half of the sheep die, uniformly distributed among all age groups. Every sheep older than four years is sent to the butchery. The farmer would like to sell (living) lambs younger than one year for their skin. What proportion of the lambs can be sold every year to ensure that the size of the herd remains the same? In what ratio will the sheep then be distributed among individual age categories?
Solution. The matrix of the model (without action of the farmer) is
/0   2   5 2\
L =
1
2 0
Vo
0 0
* ?
0 \
0 0
0/
The farmer can influence how many sheep younger than one year stay in his herd to the next year, that is, he can influence the element l\2 of the matrix L. Thus we are dealing with the model
/0   2   5 2\_
a   0   0 0 L~   0   i   0 0 \0   0   \ 0/
We are looking for an a such that the matrix has the eigenvalue 1 (we know that it has only one real positive eigenvalue). The characteristic polynomial of this matrix is
5A 1
If we require it to have 1 as a root, then a = | The farmer can thus sell \ — \ = ^ of lambs that are born that year. The corresponding eigenvector for the eigenvalue 1 of the given matrix is (20,4,2,1) and in these ratios the population stabilises.
□
A - 2a\2
209
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.G.5. Consider the Leslie population growth model for the population of rats, divided into three groups according to age: younger than one year, between one year and two years and between two years and three years. Assume that there exists no rat older than three years. The average birth-rate of one rat in individual age categories is the following: in the first group it is zero, in the second and in the third it is 2 rats. The mortality in the second group is zero, that is, the rats that survive their first year die after three years of life. Determine the mortality in the first group, if you know that the population stagnates (the total number of rats does not change). O
3.G.6. Model of evolution of a whale population. For the evolution of a population, females are important. The important factor is not age but fertility. From this point of view we can divide the females into newborns (juvenile), that is, females who are yet fertile; young fertile females; adult females with the highest fertility, and postclimacterial females who are no longer fertile, but are still important with respect to taking care of newborns and food gathering.
We model the evolution of such a population by time. For a time unit, we choose the time it takes to reach adulthood. A newborn female who survives this interval becomes fertile. The evolution of a young female to full fertility and to postclimacterial state depends on the environment. That is, the transition to the next category is a random event. Analogously, the death of an individual is also a random event. A young fertile female has less children per unit interval than an adult female. We formalise these statements.
Denote by x1(t), x2(t), x3(t), x4(t) the number of juvenile, young, adult and postclimacterial females in time t respectively. The amount can be expressed as a number of individuals, but also as a number of individuals relative per unit area (population density), or as a total biomass. Denote further by p± the probability that a juvenile female survives the unit time interval and becomes fertile, and by p2 and p3 the respective probabilities that a young female becomes adult and that an adult female becomes old. Another random event is the death (positively formulated: survival) of females who do not move to the next category - we denote the probabilities respectively as q2, q3 and q4 for young, adult and old females. Each of the numbers pi, p2, p3,q2,q3, q4 is a probability from the interval [0.1].
A young female can survive, reach adulthood or die; these events are mutually exclusive, together they form a sure certain event and cannot be excluded. Thus, p2 + q2 < 1. For similar reasons p3 + q3 < 1. Finally, we denote by f2 and f3 the average number of daughters of a young and adult female, respectively. These parameters satisfy 0 < f2 < f3.
The expected number of newborn females in the next time interval is the sum of the daughters of young and of the adult females, that is
Xl(t + 1) = f2X2{t) + f3X3{t).
Denote temporarily by x2tl (t +1) the number of young females in time t + 1, who were juvenile in the previous time interval.
Denote temporarily by x2t2(t + 1) the number of young females, who were already fertile in time t, survived that time interval, bud did not move into the adulthood.
The probability p\ that a juvenile female survives the interval can be expressed by classical probability, that is, by the ratio x2il{t + l)/x1(t). Similarly the probability q2 can be expressed as the ratio x2t2(t + l)/x2(t). Since young females in time t + 1 are exactly those who survived the juvenile stage and were already fertile, did survive and did not evolve,
x2(t + 1) = x2,i(t + 1) + x2,2(t + 1) = Pixi(t) + q2x2(t).
Similarly, the expected number of fully fertile females is
x3(t + 1) = p2x2(t) + q3x3(t)
and the expected number of postclimacterial females is
x4(t + 1) = p3x3(t) + q4x4(t).
210
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Figure i. Evolution of a population of orca whale. On the horizontal axis the time is in years, on the vertical axis is the size of the population. Individual areas depict the number of juvenile, young, adult and old females respectively, from below.
Now we can denote
[0   h   h 0\
Pi   92    0 0
o  P2 q-3 o
\o    o   p3 qAJ
A
x{t)
x2(t) x3(t) \x4(t)J
and rewrite the previous recurrent formulas in matrix form
x(t + l) = Ax(t).
Using this matrix difference equation we can compute the expected number of females in individual categories, if we know the distribution of the population at some initial time.
Specifically, for the population of orca whales the following parameters were observed:
Pl = 0.9775,   q2 = 0.9111,   f2 = 0.0043, p2 = 0.0736,   g3 = 0.9534.   /3 = 0.1132, p3 = 0.0452,   q4 = 0.9804; The time interval is in this case one year.
If we start at the time t = 0 with a unit measure of young females in some unoccupied area, that is, with the vector
x(0) = (0,1,0,0)T, we can compute
x(l)
C(2) =
/ 0
0.9775
0
V o
( °
0.9775
0
V o
0.0043	0.1132
0.9111	0
0.0736	0.9534
0	0.0452
0.0043	0.1132
0.9111	0
0.0736	0.9534
0	0.0452
0 0 0
0.9804/
0 0 0
0.9804/
1 0
W
/0.0043X 0.9111 0.0736 0
V   o /
\
/0.0043X 0.9111 0.0736
V   o /
/0.01224925X 0.83430646 0.13722720
\0.00332672/
The results of the computation can be also expressed graphically; see the diagram 1. Try a computation and graphical depiction of the results for a different initial distribution of the population. The result should be an observation that the total population grows exponentially, but the ratios of the sizes of individual groups stabilise on constant values. The matrix A thus has the eigenvalues
Ai = 1.025441326, A2 = 0.980400000, A3 = 0.834222976, A4 = 0.004835698.
The eigenvector associated with the largest eigenvalue Ai is
w = (0.03697187, 0.31607121, 0.32290968, 0.32404724);
211
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
this vector is normed such that the sum of its components equals 1.
Compare the evolution of the size of the population with the exponential function F(t) = \\x0, where x0is the total size of the initial population. Compute also the relative distribution of individual categories in the population after a certain time of evolution. Compare it with the components of the eigenvector w. They will appear very close, this is caused by the fact that A has only a single eigenvalue with the greatest absolute value. Also, the vector space generated by the eigenvectors associated with the eigenvalues A2, A3, A4 has only the zero vector with the non-negative orthant intersection. The structure of the matrix A itself does not ensure such an easily predictable evolution, since it is a reducible matrix (see ??).
3.G.7. Model of growth of population of teasels Dipsacus sylvestris. This plant can be seen in four stages. Either as a blossoming plant or as a rosette of leaves. With the rosette there are three sizes - small, medium and large. The life cycle of this monoecious perennial plant can be described as follows.
A blossoming plant produces a number of seeds in late summer and dies. From the seeds, some of them sprout in that year into a rosette of leaves, usually of medium size. Other seeds spend the winter in the ground. Some of the seeds in the ground sprout in the spring into a rosette, but because they were weakened during the winter, the size is usually small. After three or more winters the "sleeping" (formally, dormant) seeds die as they loose the ability to sprout. Depending on the environment of the plant, a small or medium rosette can grow during the year, and any rosette can stay in its category or die (wither, be eaten by insects, etc.) A medium or large rosette can burst into a flower in the next year. A blossoming flower then produces seeds and the cycle repeats.
In order to be able to predict the spreading of the population of the teasels, we need to quantify the described events. Botanists discovered that a blossoming plant produces on average 431 seeds. The probabilities that a seed sprouts, that a rosette grows or bursts into a flower are summarised in the following table:
Note that all the relevant events in the life cycle have their probabilities given and that the events are mutually incompatible.
Imagine that we always observe the population at the beginning of the vegetative year, say in March, and that all considered events take place in the rest of the year, say from April to February. In the population there are blossoming flowers, rosettes of three sizes, produced seeds and seeds that have been dormant for a year or two. This leads to a division of the population into seven classes - just-produced seeds, seeds dormant for one year, seeds dormant for two years, rosettes small, medium and large and blossoming flowers. But the just-produced seeds are changed either into rosettes or they spend winter in the same year, thus they do not form an individual category. We denote:
event
probability
seed produced by a flower dies
seed sprouts into a small rosette in the current year
seed sprouts into a medium rosette in the current year
seed sprouts into a large rosette in the current year
seed sprouts into a small rosette after spending the winter
seed sprouts into a medium rosette after spending the winter
seed sprouts into a large rosette after spending the winter
seed sprouts into a small rosette after spending two winters
seed dies after spending one winter
small rosette survives but does not grow
medium rosette survives but does not grow
large rosette survives but does not grow
small rosette grows into a medium one
small rosette grows into a large one
medium rosette grows into a large one
medium rosette bursts into a flower
large rosette bursts into a flower
0.172 0.008 0.070 0.002 0.013 0.007 0.001 0.001 0.013 0.125 0.238 0.167 0.125 0.036 0.245 0.023 0.750
212
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
xi (t) — the number of seeds dormant for one year in the spring of the year t %2 (t) — the number of seeds dormant for two years in the spring of the year t x3 (t) — the number of small rosettes in the spring of the year t Xi (t) — the number of medium rosettes in the spring of the year t x5 (t) — the number of large rosettes in the spring of the year t x&it) — the number of blossoming flowers in the spring of the year t The number of produced seeds in the year t is 431x6 (t). The probability that a seed stays dormant for the first year equals the
probability that the seed does not sprout into a rosette and does not die, that is, 1 - (0.008 + 0,070 + 0.002 + 0,172) = 0.748.
The expected number of seeds dormant for winter in the next year is thus
Xl(t + 1) = 0.748 ■ 431x6(i) = 322.388x6(i).
The probability that the seed that has been dormant for one year stays dormant for the second year equals the probability that the dormant seed does not sprout into a rosette and that it does not die, that is, 1 — 0.013 — 0.007 — 0.001 — 0.013 = 0.966. The expected number of seeds dormant for two winters is thus
x2(t + T) = 0.966xi(f).
A small rosette can sprout from the seeds immediately, from a seed dormant for one year or from a seed dormant for two years. The expected number of small rosettes sprouted from non-dormant seeds in the year t equals 0.008 ■ 431x6(i) = 3.448x6 (t) ■ The expected number of small rosettes sprouted from the seeds dormant for one and two years is 0.013a;i (t) and 0.010x2(f) respectively. With these newly sprouted small rosettes there are in the population also the older small rosettes (those that have not grown yet) - of those there are 0.125x3 (t). The total expected number of small rosettes is thus
x3(i+l) = 0.013xi(i)+0.010x2(f)+0.125x3(i)+3.448x6(f).
Analogically we determine the expected number of medium and large rosettes
x4(t + 1) =0.007xi(i) + 0.125x3(f) + 0.238x4(i) + 0.070 ■ 431x6(f) = =0.007xi(i) + 0.125x3 (f) + 0.238x4(i) + 30.170x6,
x5(t + 1) =0.245x4(i) + 0.167x5(f) + 0.002 ■ 431x6(f) = =0.245x4(i) + 0.167x5 (f) + 0.862x6(f).
The blossoming flower can arise either from medium or from large rosette. The expected number of blossoming flowers is thus
x6(t + 1) = 0,023x4(f) + 0,750x5(i). We have thus reached six recurrent formulas for individual components of the investigated plant. We now denote
A =
(  0 0        0        0 0
0.966 0        0        0 0
0.013 0,010 0.125     0 0
0.007 0 0.125 0.238 0
0.008 0 0.038 0.245 0.167
\  0 0        0 0.023 0.750
322.388\ 0
3.448 30.170 0.862
0 /
c(t) =
/xi(i)\
x2(t)
x4(t)
and write the previous equalities in matrix form suitable for the computation
x(t + l) = Ax(t).
If we know the distribution of the individual components of the population in some initial year t = 0, then we can compute the expected numbers of flowers and seeds in the following years. We can also compute the total number of individuals n(t) at the
6
timet,n(t) =     xiif)- Wecan compute the relative distribution of the individual componentsxj(i)/n(i),i = 1,2,3,4,5,6
i=l
and the yearly relative change in the population n(t + l)/n(t). The results of such calculations for fifteen years, and the case above of one blossoming flower, are given in the table 1. Unlike the whale population, the image would not be very clear, as
213
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
t		xi	X'i	X4	X5	x6	n(t)
0	0.00	0.00	0.00	0.00	0.00	1.00	1,00
1	322,39	0,00	3,45	30.17	0.86	0.00	356,87
2	0,00	311,43	4,62	9,87	10,25	1,34	337,50
3	432,13	0,00	8,31	43,37	5,46	7,91	497,18
4	2550,50	417,44	33,93	253,07	22,13	5,09	3 282,16
5	1 641,69	2463,78	59,13	235,96	91,78	22,42	4514,76
6	7 227,10	1 585,88	130,67	751,37	107,84	74,26	9877,12
7	23 941,29	6981,37	382,20	2486,25	328,89	98,16	34218,17
8	31 646,56	23 127,29	767,29	3 768,67	954,73	303,85	60568,39
9	97 958,56	30570,58	1786,27	10 381,63	1 627,01	802,72	143 126,78
10	258788,42	94627,97	4570,24	27 597,99	4358,70	1459,04	391402,36
11	470376,19	249989,61	9912,57	52 970,28	10991,08	3 903,78	798 143,52
12	1 258 532,41	454383,40	23 314,10	134915,73	22317,98	9461,62	1902 925,24
13	3050314,29	1215742,31	56442,70	329291,15	55 891,57	19841,54	4727 523,56
14	6396675,73	2946603,60	127 280,49	705 398,22	133 660,97	49492,37	10359111,38
15	15 955747,76	6 179 188,75	299 182,59	1721756,52	293 816,44	116469,89	24566161,94
t	Xl{t)	x2{t)	x-i{t)	xA(t)	x5(t)	x6(t)	n(t + l)
	n(i)	n(i)	n(t)	n(i)	n(t)	n(t)	n(t)
0	0,000	0,000	0,000	0,000	0,000	1,000	356,868
1	0,903	0,000	0,010	0,085	0,002	0,000	0,946
2	0,000	0,923	0,014	0,029	0,030	0,004	1,473
3	0,869	0,000	0,017	0,087	0,011	0,016	6,602
4	0,777	0,127	0,010	0,077	0,007	0,002	1,376
5	0,364	0,546	0,013	0,052	0,020	0,005	2,188
6	0,732	0,161	0,013	0,076	0,011	0,008	3,464
7	0,700	0,204	0,011	0,073	0,010	0,003	1,770
8	0,522	0,382	0,013	0,062	0,016	0,005	2,363
9	0,684	0,214	0,012	0,073	0,011	0,006	2,735
10	0,661	0,242	0,012	0,071	0,011	0,004	2,039
11	0,589	0,313	0,012	0,066	0,014	0,005	2,384
12	0,661	0,239	0,012	0,071	0,012	0,005	2,484
13	0,645	0,257	0,012	0,070	0,012	0,004	2,191
14	0,617	0,284	0,012	0,068	0,013	0,005	2,371
15	0,650	0,252	0,012	0,070	0,012	0,005	
Table i. Modelled evolution of the popu-
lation of teasels Dipsacus sylvestris. Sizes of the individual components of population, the total size of population, relative distribution of the individual components of population and the relative increments of sizes.
the numbers of flowers are negligible compared to the numbers of seeds (the individual areas for flowers would merge in the picture).
Ai = 2.3339 A4 = 0.1187 + 0.1953i
The matrix A has eigenvalues A2 =-0.9569 + 1.4942i A5 = 0.1187 - 0.1953i
A3 = -0.9569 - 1.4942i A6 = -0.1274
The eigenvector associated with the eigenvalue Ai is
w = (0.6377, 0.2640, 0.0122, 0.0693, 0.0122, 0.0046);
this vector is normed such that the sum of its components is one. With increasing time t, the relative increment in the size of the population approaches the eigenvalue Ai, the relative distribution of the components in the population approach the components of the normed eigenvector associated with the eigenvector Ai. Every non-negative matrix that has non-zero elements in the same positions as A is primitive. The evolution of the population necessarily approaches a stable structure.
214
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.G.8. Nonlinear model of population. Investigate in detail the evolution of the population for a non-linear model from the text book (1.12), with K = 1 and
i)	rate of growth r	= 1 and the initial state p(l) =	0.2
ii)	rate of growth r	= 1 and the initial state p(l) =	2
iii)	rate of growth r	= 1 and the initial state p(l) =	3
iv)	rate of growth r	= 2.2 and the initial state p(l)	= 0.2
v)	rate of growth r	= 3 and the initial state p(l) =	0.2
Compute some members of the sequence and predict the future growth of the population. Solution.
i) The first ten members of the sequence p(n) is in the following table. From there we can see that the size of the population converges to the value 1.
n	P(n)
1	0.2
2	0.36
3	0.5904
4	0.83222784
5	0.971852502
6	0.999207718
7	0.999999372
Graph for the evolution of the population for r = 1 and p(l) = 0.2:
ii) For the initial value p(l) = 2 we obtain p(2) = 0 and after that the population does not change.
iii) For p(l) = 3 we obtain
n	P(n)
1	3
2	-15
3	-255
4	-65535
and from there we see that the populations decreases under all bounds, iv) For the measure of growth r = 2,2 and the initial state p(l) = 0.2 we obtain
n	P(n)
1	0.2
2	0.552
3	1.0960512
4	0,864441727
5	1.122242628
6	0.820433675
7	1.144542647
8	0.780585155
9	1.157383491
10	0.756646772
11	1.161738128
12	0.748363958
!3	1.162657716
14	0.74660417
Instead of convergence we obtain in this case an oscillation - after some time the population jumps between the values 1,16 and 0.74. The graph of the evolution of the population for r = 2,2 and p(l) = 0.2 then looks as follows: v) For the rate of growth r = 3 and the initial state p(l) = 0.2 we obtain
215
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
n	P(n)
1	0.2
2	0.68
3	1.3328
4	0.00213248
5	0.008516278
6	0.033847529
7	0.131953152
8	0.475577705
9	1.223788359
10	0.402179593
11	1.123473097
12	0.707316989
13	1.328375987
14	0.019755658
15	0.077851775
16	0.293224403
17	0.91495596
18	1.148390614
19	0.63715945
20	1.330721306
21	0.010427642
22	0.041384361
23	0.160399447
In this case the situation is more complicated - the population starts oscillating between more values. In order to be able to see between what values, we would need to compute more members. For the members from the table we have the following graph:
□
3.G.9. In a laboratory an experiment is carried on with the same probability of success and failure. If the experiment succeeds, the probability of success of the second experiment is 0.7. If the first experiment fails, the probability of success of the second experiment is only 0.6.
This process is continued indefinitely. For any n e N determine the probability that the n-th experiment is successful.
Solution. Introduce the probabilistic vector
xn = (x^, x2n)T, «eN,
where x\ is the probability of the success of the n-th experiment and x2n = 1 — x\ is the probability of its failure. According to the statement
and hence also
'0.7 0.6^ (\j2\ _ /l3/20^ v0.
Using the notation
\/3/10 2/5
X2    1 0.3   0.4/   11/2/     \ 7/20 /
rp = (7/10 3/5
it holds that
(1) xn+1 = T ■ xn, neN,
216
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
because the probabilistic vector xn+1 depends only on xn and this dependency is identical for both x2 and x\. From the relation (1) we have directly
(2) xn+1 = T-T-xn_i = • •• = Tn-Xl,   n > 2, n G N.
Therefore we express Tn, n G N. It is a Markov process, and thus 1 is an eigenvalue of the matrix T. The second eigenvalue O.l.and follows for instance from the fact that the trace (the sum of the elements on the diagonal) equals the sum of the eigenvalues (every eigenvalue is counted with its algebraic multiplicity). To these eigenvalues then correspond the eigenvectors
We thus obtain
T=(i -i)'(o i/°io)'(i -1
that is, for n G N we have
Substitution
2    1\   a     0 v   (2 1
1 -l)  '  {O      1/10        '   [l -1
2 1 \   /ln     0  \   (2 1
i -y'vo  \§-n)\\ -l 2   i y1 _ i /i i
1   -l)    ~ 3 \l -2 and multiplication yields
1/2+ 10-" 2-2.10-"\
3 V1-10"™   1 + 2-10-ny''
From there, from (1) and from (2) it follows that
/2        1      1        1 V
Specially, we see that for big n the probability of success of the n-th experiment is close to 2/3. □
3.G.10. A student in a student dormitory is very "socially tired". As a result, he is not able to fully perceive the universe around him and coordinate his movements. In this state he decides to invite his friend who lives at the end of the hall to the party-in-progress. But at the other end of the hall there lives somebody he definitely does not wish to invite.
He is so "tired", that he attains the decision to make a step in a desired direction only in 53 of 100 attempts (in the remaining 47, he makes a step in exactly the opposite direction).
Assuming that he starts in the middle of the hall and that the distance to both of the doors at the ends corresponds to twenty of his awkward steps, determine the probability that he first reaches the desired door.
o
3.G.11. Let n G N of persons be playing the "silent post". For simplicity, assume that the first person whispers to the second person exactly one (arbitrarily chosen) of the words "yes", "no". The second person then whispers to the third person
the choice of the words "yes", "no" that the second person thinks the first person whispered. This continues to the n-th person. If the probability that the word changes (on purpose or accidentally) to the alternative word during one transmission is p G (0.1), determine for large n G N the probability that the n-th person correctly receives the same word as transmitted by the first person.
217
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution. We can view this problem as a Markov chain with two states called Yes and No. We say that the process is in the "yes" state in time m e N, if the m-th person thinks that the received word is "yes". For the order of the states "yes", "no", the probabilistic matrix is
T ^ A - p p
p   i -p
The product of the matrix Tm_1 and the probabilistic vector of the initial choice of the first person then gives the probability of what the m-th person thinks. We do not have to compute the powers of this matrix, because all the elements of the matrix T are positive numbers. Furthermore, this matrix is doubly stochastic. Thus for large n e N the probabilistic vector is close to the vector (1/2,1/2)T. The probability that the n-th person says "yes", is thus approximately the same as the probability that the n-th person says "no", independently of the initial word. For a large number of participants roughly half of them hears "yes". We repeat that this does not depend on the initial word.
For completeness, we determine what would be the result if we assumed that the probability of change from "yes", to "no" is for any person equal to p e (0.1) and the probability of change from "no" to "yes", equals (generally distinct) q e (0.1). In this case for the same order of the states we obtain a probabilistic matrix
T=\l~P 0
P      1 - ?y
This leads (for large n e N) to the probabilistic vector close to the vector
T
P
Kp + q p + q, which for instance follows from the expression of the matrix
1
rpn
(q q)+d-p-qr(p -q)
\P   PJ \-P    Q J
p + q
Again, with sufficiently many people it does not depend on the initial choice of the word. Simply speaking, in this model, it does not depend on the initial state, because the people decide what the transmitted information is about; more precisely, the people themselves decide about the frequency of appearance of "yes" and "no", if there are enough of them and there is no checking present.
The obtained result was experimentally confirmed. In a psychological experiment, an individual was repeatedly exposed to an event that could have been interpreted in two ways. It was being done in time intervals that ensured that the subject still remembered the previous event. See for instance "T. Havránek et al.: Matematika pro biologické a lékařské védy, Praha, Academia 1981", where there is an experiment in which an ambiguous object (say, a drawing of a cube which can be perceived from both the bottom and the top) is in fixed time intervals lighted on. Such process is a Markov chain with the transition matrix
L-p q p 1-q;
wherep, q e (0.1). □
3.G.12. Petr regularly meets his friend. But he is "well-known" for his bad timekeeping. But he is trying to change. Thus in half of the cases he arrives on time, and in one tenth of the cases he comes even sooner, given that he was late for the previous meeting. But if he was on time or sooner for the last meeting, he returns back to his "carelessness" and with probability 0.8 he arrives late, and with only 0.2 he is on time. What is the probability that on the 20*^ meeting he arrives late, given that he was on time on the eleventh?
Solution. This is a Markov process with states "Petr comes late", "Petr comes on time", "Petr comes sooner" with the probabilistic transition matrix (with the given order of states)
218
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
/OA   0.8 0.8\ T =   0.5   0.2   0.2 . \0.1    0     0 /
The eleventh meeting is determined by the probabilistic vector (0,1, 0)T (when Petr comes on time). To the twentieth meeting corresponds the vector
/0\      /0,571 578 368\ T9 ■   1   =   0,371316 224 . \0/     \0,057105 408/
The desired probability is thus 0,571578 368 (exactly). We add that
/0.571316 224   0.571578 368   0.571578 368\ T9 =   0.371512 832   0.371 316 224   0.371 316 224 . \0.057170 944   0.057105 408   0.057105 408/
From this, it is seen that it really does not depend on whether Petr came on the eleventh meeting late (first column), on time (second) or sooner (third). □
3.G.13. Two students A and B spend every Monday morning by playing a certain computer game. The person who wins then pays for both of them in the evening in the restaurant. The game can also be a draw - then each pays for half the meal. The result of the previous game partially determines the next game. If a week ago student A has won, then with the probability 3/4 wins again and with probability 1 /4 it is a draw. A draw is repeated with probability 2/3. With probability 1/3 the next game is won by B. If student B won a game, then with probability 1/2 he wins again and with probability 1/4, student A wins the next game. Determine the probability that today each of them pays half of the costs, if the first game played long time ago was won by A.
Solution. This a Markov process with the states "student A wins", "the game ends with a draw", "student B wins" (in this order), with the probabilistic transition matrix
/3/4    0 l/4\ T= I 1/4   2/3   1/4 . V 0    1/3 1/2/
We want to find the probability of the transition from the first state to the second after a large number n e N of steps (weeks). The matrix T is primitive, because
/9/16    1/12    5/16 \ T2 =   17/48   19/36   17/48 . \1/12    7/18     1/3 /
Thus it suffices to find the probabilistic eigenvector 1«, of the matrix T associated with the eigenvalue 1. It is straightforward to compute
_ (1  3 2\T
Xoo~ \r r 7) ■
The vector x^ differs only very slightly from the probabilistic vector for large n. It does not depend on the initial state. For large n e N, we obtain
/2/7   2/7 2/7\ Tn «   3/7   3/7   3/7 . \2/7   2/7 2/7/
The desired probability is the element of this matrix on the second position in the first column (the second component of the vector Xqc). Hence the result is 3/7. □
3.G.14. Popularity of the media. In a certain country there are two television channels. From a public survey it follows that in one year 1/6 of the viewers of the first channel move to the second, 1/5 viewers of the second move to the first channel.
219
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Determine the time evolution of the number of viewers watching given channels using Markov processes. Write down a matrix of the process, and find its eigenvalues and eigenvectors. O
3.G.15. Students at the lecture. Students can be divided into, say, three groups - those that are present at a lecture and pay attention, those that are present but pay no attention and those who are in a pub instead. Now observe, lecture after lecture, how the numbers in the individual groups change. The first step is to observe what are the probabilities that a student changes his state. Suppose that it is as follows:
A student who pays attention: with probability 50% stays in the same state, with 40% stops paying attention and with 10% moves to the pub. A student who pays no attention: starts paying attention with 10%, with 50% stays in the same state and with 40% moves to the pub. A student who is in the pub has zero probability of returning to the lectures.
How does the model evolve in time? How does the situation change if we assume at least ten percent probability that a student returns from the pub to the lecture (but is not going to pay any attention)?
/0.5   0.1 0\
Solution. The matrix of the Markov process is   0.4   0.5   0 I. Its characteristic polynomial is (0.5 — A)2(l — A) —
\0.1   0.4 1/
0.4(1 — A) = 0. Evidently one is an eigenvalue of this matrix (the other roots are 0.3 and 0.7). In the course of time, the students divide into groups as described by the corresponding eigenvector - which is a solution of the equality
= 0. These are multiples of the vector (0,0,1). In other words, all the students end up in the pub.
Such a result is clear even without any computation - as the probability of returning from the pub is zero, all students end up in the pub. Adding 10 percent possibility for leaving the pub, this changes. The corresponding matrix is now
/0.5   0.1    0 \
0.4   0.5   0.1 I. Again the state stabilises on the eigenvector associated with the eigenvalue 1. In this case the solution of \0.1   0.4 0.9/ the equation
-0.5	0.1	°\	X
0.4	-0.5	0	[y
0.1	0.4	0/	
-0.5	0.1	0
0.4	-0.5	0.1
0.1	0.4	-0.1
is wanted. A solution is for instance the vector (1,5,21). The distribution of the students in the individual group is then given by the multiple of this vector such that the coordinates sum to one, that is, the vector (^, ^, |y). Again, most of the students end up in the pub, but some will be at school. □
3.G.16. Roulette. A roulette player has the following strategy: he comes to play with €10. He always bets everything he has. He always bets on black (there are 37 numbers in the roulette, 18 black, 18 red and zero). The player ends whenever he has either nothing, or when he wins €80. Consider this problem as a Markov process and write down its matrix.
Solution. In the course of the game and at its end, the player can have only one of the following amounts of money (in €): 0, 10, 20, 40, 80. If we view the situation as a Markov process, then these amounts corresponds to its states, and we construct the matrix:
/l   a   a   a 0\
0 0 0 0 0 A=   0   b   0   0 0
0 0 & 0 0 \0   0   0   b 1/
where a = || and b = ||. Note that the matrix is probabilistic and singular. The eigenvalue 1 is a double one. The game does not converge to a single vector x^, but ends in one of the eigenvectors associated with eigenvalue 1, that is, either (1,0,0,0,0) (the player looses it all), or (0,0, 0,0,1) (the player wins €80). Furthermore we observe that the game ends
220
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
A	a + ab + ab2	a + ab	a	°\
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
\o	b3	b2	b	
after three bets, that is, the sequence {An}'^D=1, is constant for n > 3:
A°° := AA = An
We easily determine that the game ends with the probability a + ab + ab2 = 0.885 as a loss and with the probability roughly 0.115 as a win of €80. (We multiply by the matrix ^4°° the initial vector (0,1, 0,0,0) and obtain the vector (a + ab +
ab2, 0,0, 0,b3).) □
3.G.17. Consider the situation from the previous case and assume that the probability of both win and loss is 1/2. Denote by A the matrix of the process. Without using any computational software determine A100. O
3.G.18. Based on the temperature at 14:00, the days are divided into warm, average, and cold. From the all-year statistics, after a warm day the next day is warm in 50 % of the cases and is average in 30 % of the cases, after an average day the next day is average in 40 % of the cases, and cold in 30 % of the cases, and after a cold day, the next day is cold in 50 % of the cases, and average in 30 % of the cases.
Without any further information, derive how many warm, cold, and average days can be expected in a year. Solution. For each day exactly one of the states warm day, average day, cold day is attained. If the vector xn has as its components the probabilities that a certain (n-th) day is warm, average and cold (respectively), then the components of the vector
/0.5   0.3 0.2\ xn+1 =   0.3   0.4   0.3 -xn \0.2   0.3 0.5/
show the probabilities that the next day is warm, average and cold respectively. To verify, it suffices to substitute
v0/        V0/ V1/
while for instance for the third choice we must obtain the probabilities that after a cold day there follows a warm, average and
cold day respectively. We see that the problem is a Markov chain problem with probabilistic transitional matrix
/0.5   0.3 0.2\ T=   0.3   0.4   0.3 . \0.2   0.3 0.5/
Because all the elements of this matrix are positive, there exists a probabilistic vector
/  1      2      3 ^ ^ •^oo       \Xoo> ^oo' Xoo) i
to which the vector xn approaches as n grows, independently of the vector xn for small n. Furthermore, by the corollary of the Perron-Frobenius theorem, Xoo is the eigenvector of the matrix T with the eigenvalue 1. Thus
xoo   —   0.52;^   +   OSx^   + 0.22;^,
2-00   =   0.22;^   +   0.32;^   + 0.52;^, 1 2;^   +        2;^   + xooi
where the last condition means that the vector 2;^ is probabilistic. It is easy to see that this system has a unique solution
1       ?       3 1
^OO ^OO ^OO 2'
Thus we can expect roughly the same number of warm, average and cold days.
We emphasise that the sum of the numbers from any column of the matrix T must equal 1,otherwise it would not be a Markov process. Because TT = T, the matrix T is symmetric, and the sum of all numbers from any row also equals 1. We say that a matrix with non-negative elements and with the property that the sum of the numbers in any column or in any row
221
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
equals one is called doubly stochastic. An important property of every doubly stochastic primitive matrix (for any dimension - the number of states) is that the corresponding vector 1«, has all of its components identical. That is, after sufficiently many iterations all the states in the corresponding Markov chain are attained with the same frequency. □
3.G.19. John goes running every evening. He has three tracks - short, middle and long. Whenever he chooses a short track, the next day he feels bad about it and chooses with equal probabilities between long and medium. Whenever he chooses a long track, the next day he chooses arbitrarily among all three. Whenever he chooses the medium track, the next he feels good about it, and again chooses with equal probabilities between medium and long. Assume that he has been running like this for very many days. How often does he choose the short track and how often the long track? What is the probability that he chooses a long track when he picked it a week before?
Solution. Clearly it is a Markov process with three possible states - choices for a short, medium or long track. This order of the states gives a probabilistic transition matrix
/ 0      0 l/3\ T = I 1/2   1/2   1/3 . \l/2   1/2 1/3/
It suffices to observe that (for instance) the second column corresponds to the choice of the medium track during the previous day. This means that with the probability 1/2, a medium track will be chosen (the second row), and with probability 1/2 a long track will be chosen (the third row). Since
/1/6    1/6 l/9\ T2 =   5/12   5/12   4/9 , \5/12   5/12 4/9/
we can use the corollary of the Perron-Frobenius theorem for Markov chains. It is not difficult to compute the eigenvector corresponding to the eigenvalue 1. It is a probabilistic vector, namely:
1  3 3xT 7' 7' 7
The numbers 1/7, 3/7, 3/7 are then respectively the probabilities that in a randomly chosen day he choose a short, medium or long track.
Suppose on a certain day, John (that is, in time neK) chooses a long track. This corresponds to the probabilistic vector
xn = (0, 0, if .
For the following day,
/ 0      0 1/3 Xn+l =    1/2   1/2 1/3 \l/2   1/2 1/3
and after seven days
Xn+7 = T7
The enumeration gives us as components of xn+j the values 0.142 861225...;   0.428569 387...;   0.428 569 387...
Thus the probability that he chooses a long track under the condition that he chose it seven days ago is roughly 0.428 569 « 3/7 = 0.428 571. □
222
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.G.20. A production line is not reliable: individual products differ in quality in a significant way. A certain worker tries to improve the quality of the products and intervenes to the process. The products are distributed into classes I, II, III according to their quality, and a report found out that
after a product of class I, the next product has the same quality in 80 % of the cases and is of quality II in 10 % of the cases;
after a product of the class II, the next product is of class II in 60 % of the cases and is of quality I in 20 % of the cases,
and
after a product of quality III, the next product is of quality III in 50 % of the cases while in 25 % of the cases it is of quality II.
Compute the probability that the 18-th product is of the quality I, given that the 16-th product is of quality III.
Solution. First we solve the problem without using a Markov chain. Since 16-th product is of the class III, the event in question is satisfied by the cases
• 17-th product is of the class I and 18-th product is of class I;
• 17-th product is of the class II and 18-th product is of class I;
• 17-th product is of the class III and 18-th product is of class I,
with probabilities respectively
• 0.25 ■ 0.8 = 0.2;
• 0.25 ■ 0.2 = 0.05;
• 0.5 ■ 0.25 = 0.125.
Thus the solution is
0.375 = 0.2 + 0.05 + 0.125.
Now view the problem as a Markov process. From the statement there corresponds the probabilistic matrix
0.8   0.2 0.25\ 0.1   0.6   0.25 . 0.1   0.2   0.5 /
The situation that the product is in class III is given by the probabilistic vector (0.0.1)T. For the next product we obtain the probabilistic vector
0.25\      /0.8   0.2   0.25\ /0\ 0.25   =   0.1   0.6   0.25   ■   0 . 0.5 /     \0.1   0.2   0.5 / \1/
For the next product in order there follows the vector
0.375\      /0.8   0.2   0.25\    /0.25\ '
0.3     =   0.1   0.6   0.25   ■ 0.25 0.325/     \0.1   0.2   0.5 /    \ 0.5 /
The first component is the desired probability.
Notice that the first method of the solution (without using the Markov process) led to the result faster and easier. But notice also how unclear it would become if we wanted to compute, say, the 22-nd or 30-th product. For the second method one can in a sense restrict the computations to the relevant parts of the matrices only instead of mindlessly multiplying the whole matrix. When using the Markov process, we have also directly obtained the probabilities that the 18-th product belongs to the class II and III. □
223
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
3.G.21. Repeated dice casting. Write down the transitional probabilistic matrix T for the Markov chain with states "maximum resulting number after n attempts" with the order of the states 1,..., 6. Then determine Tn for every n eN.
Solution. We can immediately write
T =
(1/6	0	0	0	0	
1/6	2/6	0	0	0	0
1/6	1/6	3/6	0	0	0
1/6	1/6	1/6	4/6	0	0
1/6	1/6	1/6	1/6	5/6	0
W6	1/6	1/6	1/6	1/6	V
The first column is determined by the state 1 and probability 1/6 that it is preserved (that is, the next result is one) and probability 1/6 for transition into any of the other states 2,..., 6 (the result on the dice would be 2,..., 6). The second column is given by the state 2 and probabilities 2/6 that it is preserved (the result is 1 or 2) and probability for transition 1/6 for transition into any of the other states 3,..., 6 (the result would be 3,..., 6). The last column is derived from the fact that the state 6 is persistent. That is, if 6 has already been cast, no greater result is possible. For n G N, we can directly determine Tn:
	/   a" 0	0	0	0	°\
	bn _ an bn	0	0	0	0
	cn _bn     cn _ yn	cn	0	0	0
	dn _cn     dn _ ^	-cn	dn	0	0
	£n _dn     £n _ £n	- dn en	- dn	en	0
	\ 1 - en     l-en 1	- en 1	-en 1	- en	V
where a = 1/6, & = 2/6, c		= 3/6, d	= 4/6,e =	= 5/6.	
The numbers in the first column correspond successively to the probabilities that 7i-times in a row the result is 1, n-times in a row the result is 1 or 2 and there was at least one 2 (therefore we subtract the probability given in the first row), n-times in a row the result is 1, 2 or 3 and at least once the result is 3, up to the last row where there is the probability that at least once during n throws the result is 6 (this can be easily derived from the probability of the complementary event). Similarly, in the fourth column are the non-zero probabilities of the events
"if n-times in a row the result is 1, 2, 3 or 4",
"if n-times in a row the result is 1, 2, 3,4 or 5 and at least once it is 5"
"if at least once during n attempts the result is 6". Interpretation of the matrix T as the probabilistic transition matrix of a Markov process allows for a quick expression of the powers Tn, n e N. □
3.G.22. In this problem we deal with a certain property of an animal species which is determined independently of sex but just by a certain gene - a pair of alleles. Every individual gains one allele from each parent, randomly and independently. There are forms of the gene given by various alleles a, A - they form three possible states aa, aA = Aa and AA of the property.
(a) Assume that each individual of a certain population mates only with an individual of another population, where there appears only the property caused by the pair aA. Exactly one of their offspring (a randomly chosen one) will be left on the spot and he will also mate only with an individual of that specific population, and so on. Determine the probabilities of appearance of aa, aA, AA in the considered population after certain time.
(b) Solve the problem given in the case (a), if the other population is composed only of individuals with the pair AA.
(c) Randomly chosen two individuals of opposite sex are bred. From their progeny again randomly choose two of opposite sex and breed them. If this occurs for a long time, compute the probability that both bred individuals have a pair of alleles AA, or aa, when the process of breeding ends.
(d) Solve the problem from case (c) without the condition that the individuals have the same parent. Thus just breed random individuals from a population among them, then breed among their progeny, and so on.
224
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solution. Case (a). This is a Markov process given by the matrix
A/2   1/4    0 \ T =   1/2   1/2   1/2 . V 0     1/4 1/2/
The order of the states corresponds to the order of the pairs of alleles aa, aA, AA. The numbers in the first column follow from the fact that an offspring of parents with alleles aa and a A has probability 1/2 for the pair aa and probability 1/2 for the pair aA. Similarly for the third column. The numbers in the second column follow from the fact that each of the four cases of the pairs of alleles aa, a A, Aa, AA has the same probability for an individual whose both parents have the pair a A.
Note that there is a difference between counting probability — where we must distinguish between a A and Aa (which allele comes from which parent) — and investigating just the properties caused by the pairs aA and Aa which are then the same. For determining the resulting state it thus suffices to find the probabilistic vector associated with the eigenvalue 1 of the matrix T, because the matrix
/3/8   1/4 l/8\ T2 =   1/2   1/2 1/2 \l/8   1/4 3/8/
satisfies the condition of the Perron-Frobenius theorem, that is, all of its elements are positive. The probabilistic vector is
4' 2' 4
which gives the probabilities 1/4, 1/2, 1/4 of the appearance of the combinations aa, a A and AA respectively after a very long (theoretically infinite) time.
Case (b). For the order of the pairs of alleles AA, aA, aa we obtain the probabilistic matrix
A 1/2 o\
T =   0   1/2   1 . \0    0 0/
The eigenvalues are 1,1/2 and 0. To these eigenvalues correspond respectively the eigenvectors
Therefore
1 0 0
0 1/2 0
0 0 0
1 0 0 0 1/2 0 0 0 0
From there for arbitrary n e N it follows that
/i -l i \ A o o
Tn = JO 1 -2-0 1/2 0
\0 0 1 / \0 0 0
/i -i i \ A o
=0 1 -2 ■   0 2~n
\0 0 1 / \0 0
225
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Since lim,
2~n =
0,
T
1 o o
0 0 0 0   0 0
1 1 1
0 1 2 0   0 1
for large n. Thus if individuals of the original population procreate exclusively with the member of the specific population (the one which has only AA), then necessarily after a sufficient amount of breeding there is a total elimination of the pairs aA and aa.
Case (c). There are 6 possible states (in this order)
AA, AA;   aA, AA;   aa, AA;
aA, aA;    aa, aA;    aa, aa, while these states are given by the genotypes of the parents. The matrix of the corresponding Markov chain is
A	1/4	0	1/16	0	
0	1/2	0	1/4	0	0
0	0	0	1/8	0	0
0	1/4	1	1/4	1/4	0
0	0	0	1/4	1/2	0
V>	0	0	1/16	1/4	y
If we consider for instance the situation (second column), where one of the parents has the pair AA and the other has aA, then each of the four cases (we are talking about the pairs of alleles of two randomly chosen offsprings)
AA,AA;   AA,aA;   aA,AA; aA,aA
occurs with the same probability. The probability of staying in the second state is thus 1/2 and the probability for transition from the second state to the first is 1/4 and to the fourth state also 1/4.
Again we determine the powers Tn for large n eN. Considering the form of the first and of the last column we find that 1 is an eigenvalue of the matrix T with corresponding eigenvectors
By considering only a four-dimensional submatrix of the matrix T (omitting the first and sixth row and column) we find the remaining eigenvalues
Recalling the solution of the exercise called the sweet-toothed gambler, we do not have to compute Tn. In that exercise we obtained the same eigenvectors corresponding to the eigenvalue 1 and the other eigenvalues also had their absolute value strictly smaller than 1 (the exact values were not used). Thus we obtain an identical conclusion - the process approaches the probabilistic vector
where a e [0,1] is given by the initial state. Because there is a non zero number only at the first and sixth position of the resulting vector, the states
aA,AA;    aa, AA;    aA, aA; aa,aA
disappear after many breedings. Notice that the probability that the process ends with AA, AA equals the relative ratio of the appearance of A in the initial state.
(1,0,0,0,0,0)T, (0,0,0,0,0,1)T.
i   i   i -       l +
2'    4'        4    ' 4
(a,0, 0,0,0,1 - of ,
226
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Case (d). Let the values a, b, c e [0,1] give, in this order, the relative ratios of the occurrence of alleles AA, a A, aa in the given population. We wish to obtain the expression of relative ratios of the pairs AA, a A, aa in the offspring of the population. If the choice of pairs for breeding is random, then for a suitably large population it can be expected that the relative ratio of breeding of individuals that both have A A is a2. Simlarly, the relative ratio for the pair a A and A A is 2ab, and the relative ratio for a A (both of them) is b2 and so on. The offspring of the parents with pairs AA, AA must inherit AA. The probability that the offspring of the parents with pairs AA, a A has AA is 1/2 and the probability that the offspring of the parents with pairs aA, aA has AA is 1/4. There are no other cases for an offspring with the pair AA. If one of the parents has the pair aa, then the offspring cannot have AA). The relative frequency of A A in the progeny is thus
, 1,1, b2
a ■ 1 + 2ab ■ —\-b ■ — = a + ab-\--.
2 4 4
Similarly we set the relative frequencies of the pairs aA and aa in the progeny:
We would like to describe the operation T by multiplying the vector by some constant matrix. But that is clearly not possible since the mapping T is not linear. Thus it is not a Markov process and the determination of what happens after a long time cannot be simplified as in the previous cases. But we can determine what happens if we apply the mapping T twice in succession. For the second step,
ab + bc+ 2ac + —
and
where
227
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
c2 + bc+—\ +
c2 + bc+—\ +
ab + bc + 2ac + —
1 ( ,    , b2
— I ab + bc + 2ac H--
4 V, 2
Using a + b + c= 1,) it can be shown that
b2 b2 b2
t\ = a2 + ab+ — , t\ = ab+bc+2ac+ — , t% = c2+bc+ — ,
that is,
/    a2 + ab + b2/4    \       I    a2 + ab + b2/4 \ T : [ab + be + 2ac + b2 j2    ^ \ ab + be + 2ac + b2 j2 .
' + bc+b2/4
' + bc + b2/4
We have obtained a surprising result that further application of the transform T does not change the vector obtained in the first step. This means that the appearance of the considered pairs is, after an arbitrary long time, the same as in the first generation of offspring. For a large population, we have thus shown that the evolution takes place during first generation unless there is a mutation or selection. □
3.G.23. Let there be two boxes, which contain between them n white and n black balls. Each box contains n balls. At regular time intervals a ball is taken from each box and moved to the other box. For this Markov process, find its probabilistic transition matrix T.
Solution. This problem is often used in physics as a model for blending two incompressible liquids (already introduced by D. Bernoulli in the year 1769) or analogously, as a model of diffusion of gases.
Let the states 0,1,..., n correspond to the number of white balls in the first box. This information already says how many black balls are in the first box (the remaining balls are then in the second box). If, for a certain step, the state changes from j e {1,... ,n} to j — 1, then from the first box a white ball was drawn and from the second a black ball was drawn. This happens with probability
i.i = f-
n   n n2
The transition from state j e {0,..., n — 1} to the state j + 1 corresponds to drawing the black ball from the first box and a white ball from the second box, with probability
n-j _ n-j = (n - j)2 n        n n2 The system stays in state j e {1,. probability
— 1}, if from both boxes balls of the same colour are drawn, which has the same
1 _ n-j     n-j _ j_ = 2j (n - j)
n      n n      n n2
Notice that from the state 0 it is necessary (with probability 1) to go to the state 1 and similarly from the state n with probability one to the state n — 1. In summary we obtain the matrix n2T:
/ O 1 0
n2      2 ■ l(n - 1) 22 0 (n - l)2         2 ■ 2(n - 2)
\ 0 0 0
for the order of the states 0,1,
2 ■ (n - 2)2
(n - iy
2 ■ (n - 1)1
0
0/
. ,71.
228
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
When using this model in physics we are of course interested in the distribution of balls in boxes after a certain time (the number of drawings). If the initial state is for instance 0, we can use the powers of the matrix T to observe with what probability the number of white balls in the first box is increasing. We can confirm the expected result that the initial distribution of the balls influences their distribution after a certain time in a very negligible way.
If we number the individual balls, we would instead of ball drawing draw some of the numbers 1,2,..., 2n and the ball whose number was draw would move to the other ball. We would obtain a Markov process with states 0,1,..., 2n (the number of balls in the first box), where we are not distinguishing the colour any more. This Markov chain is also very important in physics (P. and T. Ehrenfest introduced it in 1907). It is used as a model for interchange of heat between two isolated bodies.
3.G.24. Two players, A and B, gamble for money repeatedly in a certain game, which can result only in a victory for one of the players. The winning probability for player A in each individual game is p e [0,1/2). Both bet always only €1. Consequently after each game player B gives €1 to player A with probability p, and player A gives €1 to player B with probability 1 — p. They play as long as both have some money. If player A has €.x at the start of the game, and player B has €y at the start of the game, determine the probability that player A loses all the money he has.
Solution. This problem is called Ruining of a player. It is a special Markov chain (see also the exercise Sweet-toothed gambler) with many important applications. The probability in question is
We investigate what this value is for specific choices of p, x, y. If player B wants to be almost sure and requires that the probability that player A loses with him €1 000 000 c is at least 0.999, then it suffices for him to have €346 c if p = 0.495
3.G.25. In a certain company there exist two competing departments. The management has decided that every week they will measure relative (with respect to the number of employees) incomes attained by these two departments. 2 employees will then be moved to the more successful department from the other department. This process will go on for as long as both departments have some employees. You have gained a position in this company and you can choose one of these two departments where you will work. You want to choose the department which will not be cancelled due to the employee movement. What will be your choice, if one of the departments has 40 employees, the other 10 employees and you estimate that the second one will have a greater income than the first one in 54 % of the cases? O
□
(1)
■+y'
(or €1 727 if p = 0.499). Therefore it is possible in big casinos that "passionate" players play almost fair games.
□
229
CHAPTER 3. LINEAR MODELS AND MATRIX CALCULUS
Solutions of the exercises
3A.2. The daily diet should contain 3.9 kg of hay and 4.3 kg of oat. The costs per foal are then €13.82. 3.B.10.
1_ f3 + V21 21
1
21
3.B.11. xn = 2V3sin(n • (tt/6)) - 4cos(n • (tt/6)).
3.B.12. xn = -3(-l)" - 2cos(n • (2tt/3)) - 2V3sin(n • ((2tt/3)).
3.B.13. xn = (-l)"(-2n2 + 8n - 7).
3.E.2. yes, no, no, yes
3.F.I.
• The claim is true. {B := ATA, btj = (i-th row of AT) • (j-th column of A)= bjt = (j-th row of AT) ■ (i-th column of A)={j-ih column of A) • (i-th row of AT)
• The claim is not true. Consider for instance A =
1 1 0 1
3.F.3.
'10 0
A   1 0
V0   1 1
-2   0     0 1
-1
k0   0 0
3.G.5. The Leslie matrix of the given model is (the mortality of the first group is denoted by a)
/0   2 2\ la   0   0 . \0   1 oj
The stagnation condition corresponds to the fact that the matrix has 1 for the eigenvalue, that is, the polynomial A3 — 2a\ — 2a has 1 as its root, that is, a = 1 /4.
3.G.10. Again it is a special case of the Ruining of the player. It suffices to reformulate the statement accordingly. Forp = 0.47, y = 20 and x = 20 from (1) follows the result
1
0.917 =
3.G.14.
The matrix has the dominant eigenvalue 1, the corresponding eigenvector is (f, 1). Because the eigenvalue is dominant, the ratio of the viewers stabilises on 6 : 5.
3.G.17. As in (3.G.16) the game ends after three bets. Thus all the powers of A, starting with A3, are identical.
	7/8	3/4	1/2	
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
\o	1/8	1/4	1/2	1/
3.G.25. You can use the result of the exercise called Ruining of the player. According to this exercise the probability that the first department is cancelled exercise equals to
1
1
= 0.56.
^ 1-0.46 J
It was enough to substitute i p = 1 — 0.54, y = 10/2 and x = 40/2 into (1). It is thus better to choose the smaller department.
230
CHAPTER 4
Analytic geometry
position, incidence, projection - and we return to matrices again..
A. Affine geometry
4.A.I. Find a parametric equation for a line in R3 given by thotgqjjations
x   -   2y   +   z   = 2, 2x   +    y   —   z   = 5.
Solution. It is sufficient to solve the equation system. However, there is an alternative approach. Find a non-zero direction vector orthogonal to the normal vectors (1, —2,1), (2,1, —1). The cross product
(1,-2,1) x (2,1,-1) = (1,3,5)
is such a vector. The triple
[x,y,z] = [2,-1,-2]
satisfies the respective system, so a solution is
[2,-1,-2] +t (1,3, 5), teR.
□
4.A.2.   A plane in R4 is given by its parametric equation
q : [0, 3, 2, 5] + t (1, 0,1, 0) + s (2, -1, -2, 2), t,s£l Find its implicit equation.
We return to the view on geometry that we had when we studied positions of points in the plane in the 5th part of the first chapter, c.f. 1.5.1. We are interested in the properties of objects in the Euclidean space, delimited by points, straight lines, planes etc. The essential point is to clarify how their properties are related to the notion of vectors, and whether they depend on the notion of the length of vectors.
In the next part, we use linear algebra to study objects defined in a nonlinear way. To do this we nee more from the theory of matrices. The results are important in discussions of the technique of optimization, or of searching for externa of functions.
At the end of this chapter we show how the projectiviza-tion of affine spaces helps us to obtain a simplification and stability of algorithms typical for computer graphics.
1. Affine and Euclidean geometry
While clarifying the structure of solutions of linear equations in the first part of the previous chapter we find in paragraph 2.3.5 that the set of all solutions of a nonhomogeneous system of linear equations does not form a vector space. However, the solutions always arise in such a way that to one particular solution we can add the vector space of solutions to the corresponding homogeneous system. On the other hand, the difference of any two solutions of the nonhomogeneous system is always a solution of the homogeneous system. This behaviour is similar to the behaviour of linear difference equations. We see this already in paragraph 3.2.6.
4.1.1. Affine spaces. A hint as to how to deal with the theory is given already in the discussion of the geometry of the plane, c.f. paragraph 1.5.3 and further on. There we describe straight lines and points as sets of solutions of systems of linear equations. A line is considered as a one-dimensional sub-space, although its points are described by two coordinates. Parametrically, the line is defined by the sum of a single point (that is, a pair of coordinates) and multiples of a fixed direction vector. We proceed now in the same way for arbitrary dimensions.
CHAPTER 4. ANALYTIC GEOMETRY
Solution. The task is to find a system of equations with 4 variables x, y, z, u (the dimension of the space is 4) which are satisfied by the coordinates of those points which lie in the plane. The desired system must contain 2=4—2 linearly independent equations. Solve the problem by elimination of parameters. The points [x, y, z,u] eg satisfy
x   = t   + 2s,
y = 3 s,
z = 2 + t - 2s, u   =   5 + 2s,
Write the system as a matrix
2
where (,s£R,
/I
0
1
V o
-i
-2 2
-10 0 0-10 0 0-1 0     0 0
0 0 0
-1
3 2
5 )
The first two columns are direction vectors of the plane, followed by the negative identity matrix. The last column is the vector of coordinates of the point [0,3,2,5]. This is now a system int, s,x,y, z,u. Transform the obtained matrix using elementary row operations in order to have as many zero-rows on the left-hand side of the first vertical line as possible. Adding (—1)-times the first row and (—4)-times the second row to the third row and adding twice the second row to the first row gives
0 0
V o
-1
0 0
-1
0
1
0
0
-1
4
-2
-1
0 0 0
-1
0 3 -10
11
\
The bottom two rows, both with only zeros to the left of the first vertical line, imply
x +
Ay
-2y
+
10 :
11 :
o,
0.
Note that the original system can be written as
/ 1 0   0 0
0 10 0
0 0 10
\ 0 0   0 1
1 2
0 -1
1 -2 0 2
3 2
5 )
where x,y,z,u remains on the left-hand side of the equations. A similar transformation gives
V
from which
Ay
2y
+
2
-1 0 0
0 3 -10
11
+
-io, 11.
Standard affine space
Standard affine space An is a set of all points in R™ = An together with an operation which assigns the point
A + v=(a1+v1,...,an + vn)eMn = An-
to a point A = (ai,..., an) G An and a vector v =
(Vl,...,Vn) 6 1" = V.
This operation satisfies the following three properties:
(1) A + 0 = A for all points A G An and the null vector
0 G V,
(2) A + (v + w) = (A + v) + w for all vectors v, w G V and points A G An,
(3) for every two points A, B G An there exists exactly one vector v G V such that A + v = B. This vector is denoted by v = B — A, sometimes also AB.
The underlying vector space R™ is called the difference space of the standard affine space An-
Notice that care is needed about several formal ambiguities. In particular, the symbol "+" is used for two different operations. "+" is used for s^-? adding a vector from the difference space to a point in the affine space. "+" is also used for summing vectors in the difference space V = R™. We do not introduce specific letters for the set of points in the affine space. An denotes both this set of points as well as the whole structure denning the affine space.
Why distinguish between the set of points in the affine space An and its difference space V when both spaces can be viewed as Rn? It is a fundamental formal step to understanding the geometry in Rn: The issue is that geometric objects, namely straight lines, points, planes etc. do not depend directly on the vector space structure of the set R™. They do not depend at all on the fact that we work with n-tuples of scalars. We need to know only what it means to move "straight in a given direction". For instance, we can consider the affine plane as an unbounded board without chosen coordinates, but with the possibility of moving about a given vector. When we switch to such an abstract view, we can discuss the "plane geometry" for two-dimensional subspaces, without the need to work with fc-tuples of coordinates.
This point of view underlies the following definition:
4.1.2. Definition. The affine space A with the difference space V is a set of points V, together with the map
VxV^V,   (A,v)^A + v.
V is a vector space. The map satisfies the properties (l)-(3) from the definition of the standard affine space.
For a fixed vector v G V, there is a translation tv : A^r A as the restricted map
tv : V ~ V x {v} -> V,   A     A + v.
By the dimension of an affine space A, is meant the dimension of its difference space.
232
CHAPTER 4. ANALYTIC GEOMETRY
As seen in this exercise, parameter elimination can be long-winded. It is not difficult to make a mistake along the way.
Another solution All that is needed are two linearly independent vectors perpendicular to (1,0,1,0), (2, —1, —2,2). If we "guessed" that these vectors could be for example
(0, 2, 0,1), (-1, 0,1, 2), then putting x = 0, y = 3, z = 2,
u = 5 to the equations
2y +    u   = a,
—x +   z   +   2u   = b
yields a = ll,&=12. The desired implicit expression is
2y +     u   = 11,
-x +   z   +   2u   = 12.
Another solution Since
x   = t   + 2s,
y = 3 s,
z = 2 + t - 2s, u   =   5 + 2s,
Eliminate t to get
x — z   =   2 — 4s,
y = 3-s,
u   =   5 +2s,
Eliminate s to obtain two equations, namely
z - x + 2u   = 12 u + 2y   = 11,
which solves the problem.
□
4.A.3. Find a parametric equation of the plane passing through the points
A= [2,1,1],   73 =[3,4,5],   C= [4,-2,3].
Hence find a parametric equation of the open half-plane containing the point C and bounded by the line passing through the points A, B.
Solution. We need one point and two (linearly independent) vectors lying in the plane. It is enough to choose A together with the vectors B — A = (1,3,4) and C - A = (2, -3,2), which are clearly independent. A point [x,y,z] lies in the plane if and only if there exist numbers t, s e K so that
x = 2+1-M-2-s,   y = l+3-f-3-s,   z = 1+44+2-s.
Consequently a parametric equation is
[x, y, z] = [2,1,1] + t (1, 3,4) + s (2, -3, 2),   t, s e R.
Setting s = 0 gives a line passing through the points A and B. t = 0 and s > 0, defines a ray passing through C with an initial point A. A particular but arbitrarily chosen
In the sequel, we do not distinguish accurately between denoting the set of points A and the set of vectors V. We talk instead about points and vectors of the affine space A.
It follows immediately from the axioms that for arbitrary points A, B, C in the affine space A
(1) A-A=0eV
(2) B-A = -{A-B)
(3) (C - B) + (B - A) = C - A.
Indeed, (1) follows from the fact that A + 0 = A and that such a vector is unique (the first and third defining property). By adding successively B — A and A — B to A, according to the second defining property we obtain A again. Add the null vector to prove (2). Similarly, (3) follows from the defining property 4.1.1 (2) and the uniqueness.
Notice that the choice of one fixed point A0 e A determines a bijection between V and A. So for a fixed basis u in V there is a unique expression
A = A0 + x1u1 H-----h xnun
for every point A e A. We talk about an affine coordinate system (A0;ui,..., un) given by the origin of the affine coordinate system A0 and the basis u of the corresponding difference space. This is sometimes called an affine frame (A0,u).
To summarize: Affine coordinates of a point A in the frame (A0, u) are the coordinates of the vector A — A0 in the basis u of the difference space V.
The choice of an affine coordinate system identifies each n-dimensional affine space A with the standard affine space
An-
4.1.3. Affine subspaces. If we choose only such points in A which have some chosen coordinates equal to zero (for instance the last one), we obtain again a set which behaves as an affine space. This is the spirit of the following definition of the affine
Subspaces of an affine space
Definition. The nonempty subset Q C A of an affine space A with a difference space V is called an affine subspace in A if the subset W = {B - A; A, B e Q} C V is a vector subspace and A + v e Q for any A e Q, v e W.
It is important to include both of the conditions in the definition, since there are examples of sets which satisfy the first condition but not the second. One such set consists of a straight line in the plane with one point removed.
For an arbitrary set of points M C A in an affine space with a difference space V, we define the vector space
Z(M) = ({73 - A; 73, A e M}) c V
of all vectors generated by the differences of points in M.
In particular, V = Z(A). Every affine subspace Q C A itself satisfies the axioms for an affine space with the difference space Z(Q).
233
CHAPTER 4. ANALYTIC GEOMETRY
( £ 8 and variable s > 0 gives a ray initiated on the border line, going through the half-plane in which the point C lies. That means that the desired open half-plane can be expressed parametrically as
[x, y, z] = [2,1, l]+t (1, 3,4)+s (2, -3, 2),   t G R, s > 0.
□
4.A.4.   Determine the relative position of the lines
p: [1,0,3] +t (2,-1,-3), teR, q: [l,l,3] + s (1,-1,-2),    s G R.
Solution. Search for common points of the given lines (sub-spaces intersection). We have a system
1 + It = 1 + s, 0   -     t   =   1   - s,
3 -   3t   =   3   - 2s.
From the first two equations, t = 1, s = 2. This does not satisfy the third equation. Thus the system does not have a solution. The direction vector (2,-1, —3) of the line p is not a multiple of the direction vector (1, —1, —2) of the line q. Hence the lines are not parallel. Hence, the lines are skew. □
4.A.5.   Find all numbers a G K so that the lines
p: [4,-4, 8]+ t (2,1,-4), teR, q: [a, 6, -5] + s (1,-3,3),   s eR
intersect.
Solution. The lines intersect if and only if the system
4 +   2t   =      a   + s, -4   +    t   =     6   - 3s,
8   -   4i   =   -5   + 3s
has a solution. Express the system as a matrix (the first column corresponding to t, the second to s), and solve
The system has a solution if and only if the second row is a multiple of the third row. This property is satisfied only for a = 3. The point of intersection of the lines is [6, —3,4]. □
The intersection of any set of affine subspaces is either an affine subspace or is the empty set. This follows directly from the definitions.
The affine subspace (M) in A generated by a nonempty set M C A is the intersection of all affine subspaces which contain all points of M.
Affine hull and parametric description of a subspace
Affine subspaces can be described by their difference spaces after choosing a point 4 £ Miia generating set M. Indeed, (M) = {+o + v;v G Z(M) c Z(A)}. To generate the affine subspace, take the vector subspace Z(M) in the difference space generated by all differences of points in M, and add this vector space to an arbitrary point in M. We talk about the affine hull of the set of points M in A.
On the other hand, whenever a subspace U in the difference space Z(A) and a fixed point A G A is chosen, the subset A + U, created by all possible sums of A and all vectors in U, is an affine subspace. This approach leads to the notion of parametrization of subspaces:
Let Q = A + Z(Q) be an affine subspace in An. Let (ui,..., uk) be abasis of Z(Q) c Rn. Then the expression of the subspace
Q = {A + hm + ■■■+ tkuk; h,..., tk G R}
is called the parametric description of the subspace Q.
There is another way of prescribing affine spaces: If we choose affine coordinates, then the difference space may be described by a homogeneous system of linear equations in these coordinates. By inserting the coordinates of one point of the subspace Q into the system of equations, we obtain the right-hand side of the non-homogeneous system with the same matrix. The subspace Q is exactly the set of solutions of this system. The description of the subspace Q by a system of equations in given coordinates is called an implicit description of the subspace Q.
The following proposition says that we can prescribe all affine subspaces in this way. It shows the geometric nature of the solutions of systems of linear equations.
4.1.4. Theorem. Let (A0;u) be an affine coordinate system in an n-dimensional affine space A. In these coordinates, affine subspaces of dimension k in A are exactly the sets of solutions of solvable systems of n — k linearly independent equations in n variables.
Proof. Consider an arbitrary solvable system of n — k linearly independent equations cti(x) = b{, where b{ G R, i = 1,..., n — k. Suppose A = (ai,..., an)T G Rn is a fixed solution of this (non-homogeneous) system. Suppose also that U C R" is the vector space of all solutions of the homogenized system a{ (x) = 0. Then the dimension of U is k. The set of all solutions of the given system is of the form
{B;B = A+(yi,.
,yn)T,y:
(yi...,yn)T eU}c
c.f. 2.3.5. So the corresponding affine subspace is described parametrically by the initial coordinates (A0;u).
234
CHAPTER 4. ANALYTIC GEOMETRY
4.A.6. In R3, determine the relative position of the line p denned implicitly by
x + y — z = 4, x   —   2y   +   z   = —3
and the plane g : y = 2x — 1.
Solution. A normal vector to the plane is g is (2, — 1,0) (consider (?: 2x — y + Oz = 1). Since
(1,1,-1) + (1,-2,1) = (2,-1,0),
the normal vector to the plane g is a linear combination of the p normal vectors. A vector denning the line lies in a sub-space of the plane g It remains to discover whether or not they intersect. The system of equations
x   +     y   —   z   = 4, x   —   2y   +   z   = —3, 2a;   -     y =1
has infinitely many solutions, because the first two equations add to give the third one. So the line p lies in the plane g. □ The following exercise is a typical vector spaces inter-_,- .,.      section exercise. The reader should be able to 'i Q/y^z. solve this. Otherwise we recommend not con-11 tinuing with this book.
4.A.7. Find the intersection of the subspaces Q± and Q2, where
Qi : [4, -5,1, -2] + h (3, 5,4, 2) + t2 (2,4, 5,1)
+ *3 (0,3,1,2), Q2 : [4,4,4,4] + Sl (0, -6, -2, -4) + s2 (-1, -5, -3, -3),
fort1,t2,t-i,s1,s2 e R.
Solution. The point X = [xi, a;2, 23,24] e R4 lies in Q± if and only if
		" 4 "		(3)		(2)		(0\
x2		-5	+ h	5		4		3
X3	—	1		4		5		1
X4		-2		W		VV		V2/
for some numbers ti,t2,t3 e R. The point X = [xi, x2, x3, x4] e R4 lies in (..).> if and only if
Xl		"4"				(-'\
x2		4	+ Sl	-6	+ s2	-5
X3		4		-2		-3
X4		4				V-3/
Conversely, consider an arbitrary affine subspace Q C An. Choose a point B therein, and consider this point to be the origin of an affine coordinate system (B, v) for the affine space A. Since Q = B + Z(Q), it is necessary to describe the difference space of the subspace Q as a subspace of solutions of a homogeneous system of linear equations. Therefore, choose a basis v of Z(A) such that the first k vectors form a basis of Z(Q). In these coordinates, the vectors v e Z(Q) are given by equations
aj(v) = 0,   j = k + 1,..., n.
The Qi are linear forms from the dual basis to v. They are the functions which assign to a vector the corresponding coordinates in the basis v.
Hence the vector subspace Z(Q) of dimension k in the 71-dimensional space R™ is given as a solution of a homogeneous system of 71—k independent equations. The description of the chosen affine subspace in the newly chosen coordinate system (B; v) is therefore given by a system of homogeneous linear equations.
It remains to consider the consequences of the transition from the former coordinate system (A;u) to the new adapted system (B;v). It follows from a general consideration about transformations of coordinates in the following paragraph that the final description of the subspace is again a system of linear equations. This time it is non-homogeneous in general. □
4.1.5. Coordinate transformations. Any two arbitrarily chosen affine coordinate systems (A0,u), (B0,v) differ in the basis of the difference spaces, In that, the origin of the latter one is translated about the vector (B0 — A0). Hence the equations for the corresponding coordinate transformations can be read off from the rule for a transformation of a point X eA
X = B0 + x'1v1 H-----h x'nvn
= Bq + (Aq - Bq) + H-----h xnun.
Let y = (yi,..., yn)T denote the column of coordinates of the vector (^o — Bq) in the basis v. Let m = (a^) be the matrix expressing the basis u in terms of the basis v. Then
x'i =Ui + aiixi H-----h alnxn
xn=yn + anixi + In matrix notation
+ an
x' = y + M ■ x.
For example, the influence of such a change of basis on if.' 11 the coordinates of subsets is described by systems of linear equations. Suppose the system in coordinates
¥ (Aq ; u) has the form
235
CHAPTER 4. ANALYTIC GEOMETRY
for some s1: s2 G R. Hence if X lies in Qi n Q2 then the equation
5 4
+ *2
+ *3
3 1
f2\
4
5
V2/ W \2/ /4-4\ /0\.
4 + 5 -6 + si
+ s2
-5 -3 V-3/
4-1        1 -2
V4+2y v-v
has a solution for ii, £2, £3, s1, s2-
Move the vectors corresponding to si and S2 to the left-hand side. Write the equations in matrix form and reduce to echelon form. There follows
(3	2	0 0		1	o\				3 2		0	0	1	°\
5	4		3 6	5	9			0 2			9	18	10	27
4	5	1 2		3	3			0 7			3	6	5	9
V2	1	2 4		3	6)			Vo -		1	6	12	7	18j
		/3		0	0	0	0		o\					
			0	2	0	0	0		0					
			0	0	1	2	0		3					
			V 0	0	0	0	1		0 J					
So ti
2t.
= 0 and for si = t e R we have £3 = Note that for the determination of Qi n Q2, it is sufficient to know either ti, t2, £3, or si, s2. So
+ si
6 ■2 V-4/
+ S2
A^
■5 ■3 V-3/
+ t
6 ■2 V-4/
This should be checked using ti = ti = 0 and £3 = 3 — 2t. The solution is a line lying in both planes. It is Qi n Q2. □
4.A.8.   Determine whether or not the points [0,2,1],
[-1,2,0], [-2,5,2] and [0,5,4] in R3 all lie in the same plane.
Solution. Consider the vectors [0,2,1] — [—1,2,0] = (1,0,1), [0,2,1] - [-2,5,2] = (2,-3,-1) and [0,2,1] - [0,5,4] = (0,-3,-3). They are linearly dependent since the matrix
has rank 2. Hence, the given points lie in a plane. □
4.A.9. Into how many parts can three planes slice the space (R3)? Give an example of planes in a suitable position for every case.
S ■ x = b
where S is the matrix of the system. Then
S-x = S- M-1 ■ (y + M ■ x) - S ■ M-1 ■ y = b.
Thus in the new coordinates (B0;v) considered above, the system has the form
(S ■ M"1) ■ x' = &' = &+ (S ■ M"1) ■ y.
Therefore, if a subset is described by a system of linear equations in one affine frame, then it is so described in the all other affine frames. This completes the proof of the previous proposition.
4.1.6. Examples of affine subspaces. (1) The one-dimensional (standard) affine space is the subset of all points of a real straight line A\. Its difference space is a one-dimensional vector space R. The supporting set is also R. The
affine coordinates are obtained by the choice of an origin and a scale (i.e. a basis in the vector space R). All proper affine spaces are 0-dimensional. They are formed by all points of the real straight line R.
(2) The two-dimensional (standard) affine space is a set of all points in the space A2 with the difference space R2. The supporting set is R2. The affine coordinates are obtained by a choice of an origin and two linearly independent vectors (directions and scales). The proper subspaces are then all points and straight lines in the plane (0-dimensional and
1- dimensional). The lines are prescribed by the choice of a point and one vector from the corresponding difference space. The vector is a generator of direction as in the parametric definition of the straight line.
(3) The three-dimensional (standard) affine space is a set of all points in the space ^3 with the difference space R3. The affine coordinates are obtained by the choice of an origin and three linearly independent vectors (directions and scales). The proper affine subspaces are then all points, straight lines and planes (0-dimensional, 1-dimensional and
2- dimensional).
(4) Suppose there is given a nonzero vector of coefficients (ai,..., an) and a scalar b e R. Compute the subspace of all solutions of one linear equation a ■ x = b for the unknown point [x1,..., xn] e An. This is an affine subspace of dimension n — 1. We say that the subspace is of codimension 1, called a hyperplane in An.
4.1.7. Affine combinations of points. We introduce an analogue of the linear combination of vectors. Let Aq ,..., Ak be points in the affine space A. Their affine hull ({^4o • • •, ^4fc}) can be written as
{^o + t1(A1-A0) + ---+ tk(Ak - Ao)\ti, • • •, ffc G R}.
236
CHAPTER 4. ANALYTIC GEOMETRY
4.A.10. Determine whether or not the point [2,1,0] lies within the convex hull of the points [0, 2,1], [1,0,1],
[3,-2,-1], [-1,0,1].
Solution. [2,1,0] lies in the convex hull (see chapter 4.9) if and only if
[2,1,0] = h[0, 2, l]+t2[l, 0, l]+f3[3, -2, -i]+t4[-i, 0,1]
has a solution with t1,t2,t3,t4, all non-negative and ti + t2 + t3 +14 = 1- Equivalently, [2,1,0] lies in the convex hull if and only if
[2,1, 0,1] =*![0, 2,1,1] +<2[1, 0,1,1] + t3[3, -2, -1,1] + i4 [-1,0,1,1]
has a solution with t1,t2,t3,t4, all non-negative. Solving these four equations, gives (t1, t2, t3, t4)= (1,0,1/2, —1/2), so the given point does not lie in the convex hull. □
4.A.11.   In R3, a tetrahedron has vertices ABCD, where
A = [4,0,2], B = [-2,-3,1], C = [1,-1,-3], D= [2,4,-2].
a) Determine its volume.
b) Decide whether or not the point X = [0, —3, 0] lies inside the tetrahedron.
Solution, a) The volume of the tetrahedron is one sixth of volume of a parallelepiped, of which three edges from the point A are B - A = (-6, -3, -1), C - A = (-3, -1, -5) and D — A = (—2,4, —4). It is given by the absolute value of the determinant
-6   -3 -1
-3   -1   -5 = -124.
-2    4 -4
Thus, the volume of the tetrahedron is .
b) Write X as an affine combination of its vertices, by solving the system of four linear equation in four unknowns a, b, c, d given by the equality X = aA+bB +cC +dD. The solution is X = jA+\B +\C— \D'. Since the coefficient of D is negative, X does not lie in the convex hull of the points A, B, C and D. Hence the given point does not lie inside the tetrahedron. □
4.A.12.   Affine transformation of point coordinates
The point X has coordinates expressed as [2,2,3] in an affine basis {[1,2,3], (1,1,1), (1,-1, 2), (2,1,1)} (in R3). Determine its coordinates in the   standard   basis,   i.e.        in   the basis
{[0,0,0], (1,0,0), (0,1,0), (0,0,1)}.
In any affine coordinates the same set can be written as
k
(A0,...,Ak) = {t0A0+t1A1+---+tkAk;tt e R,^f2 = 1}.
i=0
Affine combinations of points In general, by the formula t0A0 + t1A1 + ■ ■ ■ + tkAk with coefficients satisfying X^=o ^ = 1 is meant the points Aq + 2~2i=i u {Ai — Aq). They are called the affine combinations of points.
The points A0 ..., Ak are in general position if they generate a fc-dimensional affine subspace. This happens if and only if for each A{, the vectors which arise as differences of this point A{, and all other vectors Aj, are linearly independent. Observe that an assignment of a series of (dim .4) + 1 points in general position is equivalent to the definition of an affine frame with the origin in the first of them.
4.1.8. Simplexes. For points in an affine space, the affine combination is a similar construction to the linear combination for vectors in a vector space. Indeed, the affine subspace generated by points A0 ..., Ak equals the set of all affine combinations of its generators. The notion "to lie on the line between two points" can be generalized. In the two-dimensional case, imagine the interior of a triangle. In general proceed as follows:
&—dimensional simplexes
Let A0,..., Ak be k + 1 points in general position in an affine space A. The set A = A(A0,..., Ak) is defined as the set of all affine combinations of points A{ with nonnegative coefficients only. This is
A = {toAo+tiAx+- ■ -+tkAk;U e [0, ljcR,^(, = l],
i=0
called a fc-dimensional simplex generated by the points A{.
A one-dimensional simplex is a line segment, a two-dimensional simplex is a triangle, while a zero-dimensional simplex is a point.
Notice that each fc-dimensional simplex has exactly k+1 faces defined by equations t{ = 0, i = 0,..., k. The faces are also simplexes, and their dimension is k — 1. We talk about the boundary of the simplex. For instance, the boundary of a triangle is formed by the three edges, and the boundary of each edge is formed by the two vertices.
The description of a subspace as a set of affine combinations of points in general position is equivalent to the parametric description. We work similarly with the parametric description of simplexes.
4.1.9. Convex sets. The subset Mofan affine space is called convex if and only if for any two points A,BeM the set contains the line segment A(A, B). Directly from the definition,
237
CHAPTER 4. ANALYTIC GEOMETRY
f(x1,x2) =
Solution. The coordinates [2,2,3] in the given basis are by the definition
[1, 2, 3] + 2 ■ (1,1,1) + 2 ■ (1, -1, 2) + 3 ■ (2,1,1) = [H,5,12]
coordinates for X in the standard basis. □
4.A.13. Affine transformation of mapping. Find the affine mapping / in the coordinate system with the basis u = {(1,1), (—1,1)} and origin [2,0], defined as
,0   l) (z2) + (l in the standard basis in R2.
Solution. The change of the basis matrix from the basis u to the standard basis is
V1 1
The transformation matrix in the basis ([2, 0], u) is obtained by first transforming the coordinates in the basis ([2, 0], u) to the standard basis, i.e. to the basis ([0,0], (1,0), (0,1)). Then the transformation matrix / is applied in the standard basis. Finally it is transformed back to the coordinates in the basis ([2,0], u). The transformation equations for changing the coordinates yi, j/2 in the basis ([2,0], u) to the coordinates x\, x2 in the standard basis are
Hereby
1 -1 1 1
-1 1
+
Hence the desired mapping is
f(yi,V2)
1
2i
2
2
-1
+
xl x2
+
+
+
+
□
4.A.14. Let there be a standard coordinate system in R3 space. Agent K lives at the point S with coordinates [0,1,2]. The headquarters gave him a coordinate system with origin S and basis {(1,1,0), (-1,0,1), (0,1,2)}. Agent Bond lives at the point D with coordinates [1,1,1] and uses a coordinate
each convex set with k + 1 points in general position contains also the entire simplex defined by these points Examples of convex sets are
(1) the empty set,
(2) affine subspaces,
(3) line segments, rays p = {P + t ■ v; t >0},
(4) more generally fc-dimensional subspaces
a = {P + tx ■ Vi + ■ ■ ■ + tk ■ vk; tx,..., tk G R, tk > 0},
(5) angles in two-dimensional subspaces
f3 = {P + t1-v1+t2- v2; h > 0, t2 > 0}.
The intersection of an arbitrary system of convex sets is a convex set. The intersection of all convex sets containing a given set M is called the convex hull K.(M) of the set M.
Theorem. The convex hull of any subset M C A is
s
K.(M) = {t1A1 + ---+tsAs; = Mi > 0,A; G M}
i=l
Proof. Let S denote the set of all affine combinations on the right-hand side of the equation. To check that S is convex, choose two sets of parameters t{, i = l,..,s1, t'j, j = , s2 with the desired properties. Without loss of generality, assume that s1 = s2 and that the same points from M there appear in both combinations (otherwise simply add summands with zero coefficients). Consider an arbitrary point on the line segment given by the vertices defined by the two combinations:
+ - ■ ■+tsAs) + (l-e)(t'1A1+- ■ -+t'sAs), 0 < e < 1.
Obviously any point of this line segment lies in S.
It remains to show that the complex hull of the points A1,... ,AS cannot be smaller than S. The points A{ themselves correspond to the choice of parameters tj = 0 for all j 7^ i and t, = 1. Assume that the claim holds for all sets with at most s — 1 points. Then the convex hull of the points A1,..., As_i is (according to the assumption) formed exactly by the combinations from the right side of the equation to be proved, where ts = 0. Now consider a point A = t1A1-\-----\-tsAs G S,ts < 1, and affine combinations
+ • ■ ■+ts-1As_1) + (l-e(l-ts))As, 0<e< j^-.
It is a line segment with vertices given by parameters e = 0 (the point As) and e = 1/(1 — ts) (a point in the convex hull of A1,..., As-i). The point A is an inner point of this line segment with the parameter e = 1, and thus A lies in the convex hull of Ai,..., As. □
The convex hulls of finite sets are called convex polyhedrons.
We have a fc-dimensional simplex if and only if the vertices Aq,. .. ,Ak defining the convex polyhedron are in general position. In the case of a simplex, the expression of any of its points as an affine combination of the defining vertices is unique.
238
CHAPTER 4. ANALYTIC GEOMETRY
system with basis {(0, 0,1), (-1,1,2), (1,0,1)}. Agent K has set an appointment with agent Bond in the old brickfield which is (according to K's coordinate system) at the point [1,1,0]. To where should Bond go (regarding his coordinate system)?
Solution. The change of basis matrix from agent K's basis to the Bond's basis (with the same origins) is
The vector (0,1,2) thus has coordinates T ■ (0,1,2)T = (0, 2,1)T. Translate the origin (add the vector (-1,0,1)) to obtain the result (-1,2,2). □
4.A.15.   Find a transversal of the lines (that is, a line passing through both given lines)
,1,1] +*(2,1,0),   g: [2, 2,0] +«(1,1,1), so that [1,0,0] lies on the transversal.
Solution. The transversal lies in the plane p denned by the point [1,0,0] and the line p. Hence it lies in the plane
[l,l,l]+f(2,l,0) + s(0,l,l).
Let the point Q be the intersection of this plane with the line q. Q is obtained by solving the system
1 + 2«   =   2 + u l+t+s   = 2+u 1 + s   = u
The left-hand sides of the equations represent all three coordinates of an arbitrary point of the plane p respectively. The right-hand sides then represent the coordinates of an arbitrary point on q (the free variable is denoted u in order not to be ambiguous). Solving this system, yields s = 2, t = 2, u = 3. Putting u = 3 into the line q equation, gives Q = [5,5,3]. The desired transversal is thus given by Q and the point [1, 0,0]. The intersection of the transversal with p is at P = [7/3,5/3,1]. □
4.A.16. Find the common perpendicular between the two skew lines
p:       [3,0,3] + (0,1, 2)t, ieK q:       [0,-1,-2] + (1,2, 3)s set
Specific examples are the convex polyhedrons denned by one point and a finite number of vectors. Let «i be arbitrary vectors in the difference space R™, A e An a point. A parallelepiped Vk(A; «i,..., Uk) C An is the set
Vh{A;ux,. ..,uk) = {A + ciUi H-----\-ckuk; 0 < c, < 1}.
If the vectors ui,..., uk are independent, we talk about a fc-dimensional parallelepiped Vk(A; u±,..., Uk) C An- It is clear from the definition that parallelepipeds are convex. They are the convex hulls of their vertices.
4.1.10. Examples of standard affine exercises. (1) To find a parametric description of an implicitly given
'■^Ty?   subspace and vice versa: v™a a particular
Find a particular solution of a non-homogeneous system and a fundamental solution of the homogenized system. Then obtain (in the coordinates in which the equations have been set) the desired parametric description. In the opposite direction, write the parametric description in coordinates and then eliminate the free parameters ti,..., tk. This results in the equations denning the given subspace implicitly.
(2) To find the subspace generated by several subspaces Qi, ■ ■ ■ ,QS (of different dimensions in general). To find a plane in R$ given by a straight line and a point, or by three points). To define this subspace implicitly or parametrically:
The resulting subspace Q is always determined by one fixed point A{ in a subspace Qi and by the sum of all difference spaces. For instance,
Q = A1 + (Z({AU..., Ak}) + Z(Qi) + ■ ■ ■ + Z(QS)).
If the subspaces are given implicitly, it is possible to convert them into parametric form first. Nevertheless, different methods are advantageous in some concrete situations. Notice that it is really necessary to use one point from each of the sub-spaces. For example, two parallel lines in a plane generate the whole plane, but they share the same one-dimensional difference space.
(3) To find the intersection of the subspaces Qi, ■ ■ ■, Qs: If they are given in the implicit form, it is sufficient to
unify all equations into one system, omitting any linearly dependent ones. If the resulting system has no solution, then the intersection is empty. Otherwise, an implicit description of the affine subspace is obtained. This is the intersection we are searching for.
If parametric forms are given, we may search directly for common points as solutions of the appropriate equations, similarly to the way we find the intersections of vector spaces. If the number of subspaces is greater then two, we must search for the intersection step by step.
If one of the subspaces is denned parametrically and the other implicitly, it suffices to substitute the parametrized coordinates and to solve the resulting system of equations.
(4) To find a crossbar between two skew lines p, q in A3 + 1, passing through a given point or having a given direction:
CHAPTER 4. ANALYTIC GEOMETRY
Solution. The direction of the common perpendicular is given by the cross product of the two direction vectors. So the direction of the common perpendicular is (1,-2,1). Form a linear equation system which expresses that a vector defined by two points, one lying on p, the other on q, is parallel to the direction (1,-2,1). We get the system P - Q = k(l,-2,1), or [3, 0, 3] + (0,1, 2)t - [0, -1, -2] + (1, 2, 3)s = fc(l, -2,1).
N-v-'       V-v-'
P Q Treat this equality component-wise to give
3 — s   = k \ + t-2s   = -2k 5 + 2t-3s   = k
with the solution t = \, s = 2, k = \. Put t = 1 into the line p, to obtain the point [3,1,5] on the common perpendicular. Put s = 2 into the line q equation to obtain the point [3,1,5]. The common perpendicular is defined by the line joining these two points. □
B. Euclidean geometry 4.B.I.   Determine the distance between the lines in R3.
p: [l,-l,0]+f(-l,2,3), andg: [2, 5, -l]+t(-1, -2,1).
Solution. The distance is defined as the distance of the orthogonal projection of arbitrary points on the respective lines to the orthogonal complement of the vector subspace generated by their directions. The orthogonal complement is spanned by the cross product:
((-1,2,3), (-1,-2,1)}± = ((-1,2,3) x (-1,-2,1)} = ((8,-2,4)} = ((4,-1,2)}.
A transversal is (for example) the segment joining [1,-1,0] to [2,5, —1]. So the vector to be projected is [1, —1,0] — [2,5,-1] = (—1,-6,1). The distance between the lines is therefore:
|(-1,-6,1)-(4,-l,2)| = _4_ ||(4,-1,2)|| V2T
□
4.B.2.   Find a point A lying on the line
p : x + 2y + z - 1 = 0,    3a; - y + 4z - 29 = 0,
which is equidistant from both B = [3,11,4] and C= [-5,-13,-2].
By a crossbar we mean a straight line which has nonempty intersection with both skew lines. Thus the resulting crossbar r is a one-dimensional affine subspace. If we are given one point A e r, then the affine subspace generated by p and A is either a straight line (if A e p) or a plane (if A <£ p). In the first case, there are an infinite number of solutions, one for each point of q. In the second case, it suffices to find the intersection B of the plane {p U A) with q, and r = {{A, B}). There is no solution if the intersection is empty. If q C {p U A), there are an infinite number of solutions. If the intersection has one element, there is exactly one solution.
If a direction u e R™ is given, then we consider the sub-space Q generated by p and the difference space Z(p) + (u) c R™. Again, we obtain an infinite number of solutions if q C Q. Otherwise we consider the intersection Q with q and we finish as before.
The solutions of other practical geometric problems are based mostly on the systematic use of the steps given above.
4.1.11. Remarks on linear programming. In the beginning of the third chapter in paragraphs 3.1.1-3.1.7, we dealt with practical problems which are given by | systems of linear inequalities. Each single inequality
a±xi H----+ anxn < b
defines a halfspace in the standard affine space R™. This is bounded by a hyperplane given by the corresponding equation (compare with the definition in paragraph 4.1.9(4)). Suppose we choose the parametric description of the hyperplane
{P + hv! H-----Vtn-ivn-i}
with vectors v1,..., un_ i from the difference space. By completing these vectors by v to a basis of the whole R™, the value of
aiXi + ■ ■ ■ + anxn — b on the linear combination t1v1 + • • • + tn-1vn_1 + tnv must be positive for all vectors with either a positive or a negative
At the same time, the set of all admissible vectors for the problem of the linear programming is always an intersection of a finite number of convex sets. Hence the set itself is either convex or empty.
If the intersection is both nonempty and bounded, then it is a convex polyhedron. As justified in 3.1.1 already, each linear form is either increasing or decreasing or constant along each parametrized straight line in the affine space. Thus if a given problem from linear programming is solvable and bounded, then it has the optimal solution at one of the vertices of the corresponding convex polyhedron. The reader should be able to imagine this claim in the case of two-dimensional or three-dimensional problems. Nevertheless, the straightforward explanation in these low dimensions holds for all finite-dimensional cases.
We have given a "geometric proof of the existence part of the fundamental theorem 3.1.5. We have translated the
240
CHAPTER 4. ANALYTIC GEOMETRY
Solution. First, express the line p parametrically. Solve the system
x   +   2y   +     z   = 1, 3x   -    y   +   Az   = 29.
Rewrite the system as an augmented matrix and perform row
operations
1    2 1
3-14
1
29
1 2 1
0 -7 1
1 0 9/7 0 1 -1/7
1
26
The line p is thus described by
p :
59 26
7
7
,0
+ t
9 1
~r r
,i
59/7 -26/7
t e
It is convenient to avoid the fractions, by introducing the substitution t = 7s + 26. p is thus described by
p: [-25, 0,26] + s (-9,1,7), s£l.
The point A is obtained by requiring that the vectors
A - B = (-28 - 9s, -11 + s, 22 + 7s), A - C = (-20 - 9s, 13 + s, 28 + 7s)
have the same length. Hence
V(-28 - 9s)2 + (-11 + s)2 + (22 + 7s)2
= \/(-20 - 9s)2 + (13 + s)2 + (28 + 7s)2,
or rather
(-28 - 9s)2 + (-11 + s)2 + (22 + 7s)2
= (-20 - 9s)2 + (13 + s)2 + (28 + 7s)2.
which has the unique solution s = — 3. Therefore
A = [-25,0,26] - 3 (-9,1,7) = [2, -3,5].
□
4.B.3. Michael has a stick of length 4. Can he touch the lines p and q simultaneously with this stick, given that the stick must pass through [2,1,2]?
p   :   [-1,4,1] +«(-1,2,0),
q   :   [4,4,-l] + s(l,2,-4)?
Solution. Compute the transversal of those lines passing through [2,1,2]. It is the segment joining [1,0,1] to [3,2,3]. Its length is \[Y2, which is less than 4. So Michael can touch the lines as required. □
initial problem into a finite problem of the given cost function. An example of a practical algorithm for finding the corresponding vertices of a convex polyhedron is given in the chapter about discrete mathematics.
4.1.12. Affine maps. A map / : A —> B between affine spaces is called an affine map if there exists a linear map p : Z(A) —> Z(B) between their difference spaces such that for all A e A, v e Z (A) the following holds:
f(A + v) = f(A) + p(v). The maps / and p are determined uniquely by this property, and by arbitrarily chosen images of (dim .4 + 1) points in general position.
For an arbitrary affine combination of points t0A0 H-----1-
tsAs £ A we obtain
f(t0A0 + ■■■+ tsAs) =
= f(A0 + t1(A1-A0) + --- + ts(As - A0))
= f(A0) + t1<p(A1 - Aj) + ■ ■ ■ + tsp(As - A0)
= t0f(A0)+t1f(A1) + --- + tsf(Aa).
On the other hand, if a map preserves affine combinations, we may use a specific combination of n+1 fixed vectors generating the affine frame. After choosing successively the coefficients t0 = 0 and t{ = 1, we define the map p between difference spaces by the relation p(Ai — A0) = f(Ai). The previous computation can be read in the opposite direction, so we can check the validity and linearity of p. The assumption that the first and the last rows are equal implies that the second and the third rows are equal. So we have an affine map with the corresponding linear map p between difference spaces which we described in the chosen affine frame by this procedure. Therefore:
Theorem. Affine maps are exactly those maps which preserve the affine combinations of points.
It is sufficient to check the invariance of affine combinations for all pairs of points since we can create an arbitrary affine combination from them. The affine combination of k + 2 points A0, Ak+1 can be expressed as
r(t0A0 H-----h tkAk) + sAk+1,
_yk
where z2i=0 tk = 1 and r + s = 1. We choose a point which is an affine combination of k + 1 points only. Then make its combination with the last one. In this way, any finite affine combination can be made step by step from the combination of pairs.
4.1.13. Ratio of collinear points. The affine combinations J5        of pairs of points can be also expressed with the
help of the ratio of points on a straight line. If * q js gjven ^ an affine combination of points A and B ^ C, C = rA + sB, then we say that the number
\=(C;A,B).
241
CHAPTER 4. ANALYTIC GEOMETRY
4.B.4. In Euclidean space R4, determine the distance between the point A = [2, —5,1,4] and the subspace denned by the equations
[/ : 4xi - 2a;2 - 3x3 - 2x4 + 12 = 0,
2xi -x2- 2x3 - 2x4 + 9 = 0.
Solution. Find first a parametric expression of the subspace U. For example,
B = [0,3, 0,3] G U.
The distance between A and U equals the length of the orthogonal projection of the vector A — B to the orthogonal complement of the direction of the subspace U. However, the orthogonal complement of the U direction (it defines this subspace) - as set (of linear combination of normal vectors)
V := {t (4, -2, -3, -2) + s (2, -1, -2, -2); t, s G R}.
We need to find the orthogonal projection Pa-b of vector A — B to V, which lies in V, and thus
Pa-b = a (4, -2, -3, -2) + b (2, -1, -2, -2) for certain a, & G R. Clearly, (A - B - Pa-b) -L V, thus ((+ - B) - Pa-b) ±(4,-2,-3,-2), ((d- — B) — Pa-b) ± (2,-1,-2,-2). By substitution of vl — B and Pa-b,
((2, -8,1,1) - a(4, -2, -3, -2) - b(2, -1, -2, -2)) ■(4,-2,-3,-2) = 0,
((2, -8,1,1) - a(4, -2, -3, -2) - b(2, -1, -2, -2)) ■(2,-1,-2,-2)) = 0;
so
(2,-8,l,l)-(4,-2,-3,-2) -a(4,-2,-3,-2)-(4,-2,-3,-2) -&(2,-1,-2,-2X4,-2,-3,-2) = 0,
((2,-8,l,l)-(2,-l,-2,-2)) -a(4,-2,-3,-2)-(2,-l,-2,-2) -b(2,-l,-2,-2)-(2,-l,-2,-2 = 0.
If we compute these dot products, we obtain the system
19   -   33a   -   20b   = 0, 8   -   20a   -   13b   = 0,
is the ratio of the point C with respect to the given points A and B. Since we can express C as
C = A + s(B - A) = B + r(A - B),
the ratio A is the ratio of the length of the oriented vectors C — A and C — B. In particular, A = —1 if and only if C is at the centre of the line segment joining A and B (i.e. r = s = \ in the affine combination).
Hence the characterization of affine maps in terms of affine combinations has the following consequence:
Corollary. Affine maps are exactly those maps for which the ratios are invariant.
4.1.14. Changes of coordinates. Under the choice of an affine coordinate system (A0, u) on A and a system (B0,v) on B, we obtain the coordinate expression of the affine map / : A -> B. Itt is sufficient to express the image / (+o) of th^ origin of the coordinate system on A in the coordinate system on B. In other words, the vector / (A0) — B0 with the basis v is expressed as a column of coordinates yo- Everything else is then given by multiplying by the matrix of the map p in the chosen bases and by adding the outcome. Each affine map therefore has the following form in coordinates:
x     y0 + Y ■ x,
where yo is as above, and Y is the matrix of the map p.
As in the case of linear maps, the transformation of affine coordinates corresponds to the expression of the identity map in the chosen affine frames. The change of coordinate expression of an affine map caused by a change of the basis is computed by multiplying and adding matrices and vectors.
Let
x = w + M ■ x',
describe a change of basis on the domain by a translation w and a matrix M. Let
y' = Z + N ■ y
describe a change of basis on the range space by a translation z and a matrix N. Then
y' = z + N ■ y = z + N ■ (y0 + Y ■ x)
= (z + N ■ y0 + N -Y -w) + (N -Y ■ M) ■ x'.
Hence the affine map in the new bases is given by the translation vector z + N-y0 + N- Y- wSi matici N ■ Y ■ M.
4.1.15. Euclidean point spaces. So far, we do not need the W      notions of distance and length for geometric
^^Smfjr   considerations. But the length of vectors and the angle between vectors, as denned in the second chapter (see 2.3.18 and elsewhere), play a significant role in many practical problems.
242
CHAPTER 4. ANALYTIC GEOMETRY
with the only solution a = 3, b = —4. Hence
PA-B = 3 (4, -2, -3, -2) - 4 (2, -1, -2, -2) = (4,-2,-1,2),
where
\\Pa-b\\ = ^42 + (-2)2 + (-1)2 + 22 = 5.
Hence the distance between A and U equals 11 Pa-b \ = 5. □
4.B.5. In the vector space R4, compute the distance v between the point [0,0, 6,0] and the vector subspace
U : [0, 0, 0, 0]+ti (1, 0,1, l)+t2 (2,1,1, 0)+t3 (1, -1, 2, 3),
tl,t2,«3 G R
Solution. We solve the problem by the least squares method. Write the generating vectors of U as the columns of the matrix
(1   2 1\
0 1-1 A~   1   1 2 V   0 3/
Substitute the point [0,0, 6,0] by the corresponding vector b = (0,0,6,0)T. Now solve A ■ x = b. This is the linear equation system
x1   +   2a;2 + x2 -x1    +     x2 +
X1 +
Euclidean spaces
X3 X3
2x3 3x3
0, 0, 6, 0,
by the least squares method. (Note that the system does not have a solution - the distance would be 0 otherwise.) Multiply A ■ x = b by the matrix AT from the left-hand side. Then the augmented matrix AT ■ A - x = AT ■ b is
By elementary row operations, transform the matrix to the
normal form
/ 3   3 6 3   6 3 I 6   3 15
Continue with backward elimination
/1	1	2	2 \	/1	0	3
0	1	-1	0 .	- 0	1	-1
Vo	0	0	o/1	Vo	0	0
The standard Euclidean point space £n is the affine space An whose difference space is the standard Euclidean space R™ with the scalar product
(x,y) = yT ■ x.
The Cartesian coordinate system is the affine coordinate system (A0;u) with the orthonormal basis u.
The Euclidean distance between two points A, B <E £n is defined as the length of the vector ||j5 — A\\. This is denoted by p(A,B).
Euclidean subspaces in £n are affine subspaces, where the corresponding difference spaces are considered with restricted scalar products.
By a Euclidean point space £ of dimension n is meant an affine space, whose difference space is a real n-dimensional Euclidean vector space. The notion of a Cartesian coordinate system has an obvious meaning. Since each choice of such a coordinate system identifies £ with the standard space £n, we deal with the standard Euclidean spaces and their subspaces, with no loss of generality.
From the geometric point of view, simple properties of yaa       the scalar product like the triangular inequality, Wy-'-l      tne Cauchy inequality, Bessel's inequality, de-rived in the previous chapter (see 3.4.3), have — useful consequences:
4.1.16. Theorem. For points A,B,C G £n the following holds
(1) p(A,B)=p(B,A)
(2) p(A, B) = 0 if and only if A = B
(3) p(A,B)+p(B,C)>p(A,C)
(4) In each Cartesian coordinate system (Ao;e), the distance between the points A = A0 + a\e\ + ■ ■ ■ + anen, B = A0 + bxex H-----h bnen is \/]C™=iK ~ h)2-
(5) Given a point A and a subspace Q in £n, there exists a point P G Q which minimizes the distance between A and points in Q. The distance between A and P equals the length of the orthogonal projection of the vector A — B into Z(Q)± for an arbitrary B G Q.
(6) More generally, for subspaces Q and TZ in £n there exist points P G Q and Q G TZ which minimize the distance between points B G Q and A G TZ. The distance between the points P and Q is the length of the orthogonal projection of the vector A — B into Z(Q)± for arbitrary points B G Q and A <E7Z.
Proof. The first three properties follow directly from the properties of the length of vectors in spaces with a scalar product. The fourth follows from the ex-s pression of the scalar product in an orthonormal basis.
243
CHAPTER 4. ANALYTIC GEOMETRY
The solution is
x = (2 - 3t, t, t)T , ieK.
Note that the existence of infinitely many solutions is caused by third vector generating U, which is redundant because
3 (1,0,1,1) -(2,1,1,0) = (1,-1, 2,3).
An arbitrary (t G R) linear combination
(2-3i) (1, 0,1, l)+t (2,1,1, 0)+t (1, -1, 2, 3) = (2, 0, 2, 2)
corresponds to a point [2,0,2, 2] in the subspace U, which is the nearest point to [0,0,6,0]. The required distance is therefore
v = || [2,0,2,2] - [0,0,6,0] || = sj22 + 0 + (-4)2 + 22 = 2V&.
□
4.B.6. Compute the volume of the parallelepiped in R3 with base in the plane z = 0 and with edges given by pairs of vertices [0,0,0], [-2,3,0]; [0,0,0], [4,1,0] and [0,0,0], [5,7,3].
Solution. The parallelepiped is given by vectors (4,1,0), (—2,3,0), (5, 7,3). Its volume is the determinant
-2 5
7 =3
4 -2 1 3
= 3 ■ 14 = 42.
Note that if the order of vectors is changed, we would get result ±42, because the determinant gives the oriented volume of parallelepiped. Note further that the volume would not change if the third vector was [a, b, 3] for arbitrary a,kl. Its surface depends only on orthogonal distance between planes of its upper and lower base and their area 4 -2
1
14.
□
4.B.7. Let the points [0, 0,1], [2,1,1], [3,3,1], [1,2,1] define a parallelogram. Determine the point X lying on the line p : [0,0,1] + (1,1, l)t so that the parallelepiped defined by the given parallelogram and a point X has volume of 1. Solution. Form a determinant which gives the volume of a parallelepiped with X moving along line p:
t t t 2 1 0 1   2 0
The volume is 3t which implies t = 1/3. □
Consider the relation for the minimal distances p(A, B) for B e Q. The vector A — B decomposes uniquely as A - B = ui + u2, where ux G Z(Q), u2 G Z(Q)jL. The component u2 does not depend on the choice of B G Q. This is because any potential change of B is apparent by adding a vector from Z(Q).
Choose P = A + (-u2) = B + ux G Q. Then
\A-B\\'
KU2 + IM2
\A-p\\
Hence the minimal distance is obtained for the point P. Its value is 1111 -
The general result is obtained in a similar way. For the choice of arbitrary points A eTZ and B G Q their difference is given as a sum of vectors ui G Z(7l) + Z(Q) and u2 G (Z(7l) + Z(Q))±. The component u2 does not depend on the choice of the points. By adding suitable vectors from the difference spaces of 71 and Q, points A' and B are obtained so that the distance between them is 11 u2 | |. □
We consider more elementary problems in affine geome-
try.
4.1.17. Examples of standard problems. (1) To findthe distance from the point A G £n to the subspace
Q C £n:
A method of solving such a problem is given in proposition 4.1.16.
(2) In £2 to construct the straight line q through a given point A which forms a given angle with a given line p:
Recall that we work with angles between vectors in plane geometry already (see e.g. 2.3.21). Find a vector u G R2 lying in the difference space of the line q. Then choose a vector v having the prescribed angle with u. The desired line is given by the point A and the difference space (v). The problem has either one or two solutions.
(3) To find a line through a given point, perpendicular to a given line:
The procedure is introduced in the proof of the last but one item of proposition 4.1.16.
(4) In £3 to determine the distance between two lines p, q:
Choose any point from each of the lines, A G p, B G q. The component of the vector A — B lying in the orthogonal complement (Z(p) + Z(q))1- has length equal to the distance between p and q.
(5) In £3 to find the axis of two skew lines p a q:
By the axis we mean the crossbar which attains the minimal possible distance between the given skew lines in terms of the points of intersection. The procedure can be derived from the proof of proposition 4.1.16 (the last item). Let 77 be the subspace generated by a single point A G p and the sum Z{p) + (Z(p) + Z(q))±. Provided that the lines p and q are not parallel, 77 is a plane. Then the intersection 77 n q together with the difference space (Z(p) + Z(q))1- gives a parametric description of the desired axis. If the lines are parallel, the problem has an infinite number of solutions.
244
CHAPTER 4. ANALYTIC GEOMETRY
4.B.8. Let ABCDEFGH be a cube (with common notation, i.e. vectors E — A, F — B, G — G, H — D are orthogonal to the plane denned by vertices A, B, C, D) in Euclidean space R3. Compute the angle p between the vectors F — A and H — A.
Solution. This problem is solved using the formula for the angle between the vectors. Alternatively notice that the vertices A, F, H are the vertices of a triangle with all sides of the same length. Hence it is an equilateral triangle. Therefore
P = tt/3. □
4.B.9. Let S be the midpoint of the edge AB of the cube ABCDEFGH (with common labelling). Compute the cosine of the angle between the lines ES and BG.
Solution. Dilation (homothety) is a mapping which preserves angles. So without loss of generality, the cube edge has length 1. The coordinate system can be placed so that A is at the origin, B = [1,0, 0] and E = [0,0,1]. It follows that then:
S= [1/2,0,0],G = [1,1,1], = (1/2,0,-1) and BG = (0,1,1). The desired cosine of the angle ip is then
COs(</9)
(1/2,0,-1)-(0,1,1)
|(l/2,0,-l)||||(0,l,l)||
V2 V5
□
4.B.10. Compute the angle between the line p given by the implicit equations
x   +   3y   +   z   = 0, —x   —     y   +   z   = 0
and plane the g : x + y + 2z + 1 = 0.
Solution. The normal vector of the plane g is (1,1,2). Copy the first equation of the line p. Sum both of them, to obtain
x   +   3y   +    z   = 0, 2y   +   2z   = 0.
From this system y = —z and x = 2z. The vector (2, —1,1) is therefore the direction vector of p. In other words, p passes through the origin, and
p: [0,0,0] +t (2, -1,1), teR.
For the angle p between the vectors (1,1,2), (2, —1,1),
2-1+2 1
cos p = -=-=- = -.
Hence p = 60°. However, this is the angle between the direction vector of p and the normal vector g. The desired angle is the complement of this angle, so the solution is
30° = 90°-60°. □
4.1.18. Angles. Various geometric notions like angles, orientation, volume etc. in the point spaces 8n are denned in terms of suitable notions from Eu-clidean spaces. The angle between two vectors 4fow^g*-A— is denned at the end of the third part of the second chapter, see 2.3.21.
From the Cauchy inequality, it follows that 0 < ii^^ii < 1. So it makes sense to define the angle p(u, v) between vectors u,v e V in a real vector space with a scalar product given by the equation
i<p(u,v) =
MIIMI
0 < p(u,v) < 2tt.
This is completely in accordance with the situation in two-dimensional Euclidean space R2 and with the philosophy that the notion related to the two vectors is the issue of plane geometry. In the Euclidean plane, we use also the geometric functions cos and sin denned by a pure geometric consideration. Therefore, the angle between two vectors in higher-dimensional spaces is measured in the plane which is generated by these two vectors (or it is zero).
In an arbitrary real vector space with a scalar product, it follows that
\\u-v\\2 = \\u\\2 + \\v\\2 - 2(u-v)
— \\u\\2 + \\v\\2 — 2||u||||w|| cosp(u,v).
This is the well known cosine rule from plane geometry.
The following relation holds for each orthonormal basis e of the difference space V and a non-zero vector u e V
-El
By dividing this equation by the number
1 = y^(cosy(u,e;))2
which is the law of direction cosines p(u, e{) of the vector u.
Now we can choose definitions for angles between general subspaces in a Euclidean vector space from the definitions of angles between vectors. Concurrently it must be decided how to deal with cases where the subspaces have a non-trivial intersection. For the angle between two lines, use the smaller of the two possible angles. In the case of two nonpar-allel planes in R3 we do not say that the angle is zero. They intersect and have one direction in common:
245
CHAPTER 4. ANALYTIC GEOMETRY
4.B.11. In the real plane, find a line which passes through the point [—3, 0], so that 60° is the angle between this line and the line
p : V3x + 3y + 5 = 0.
Solution. The given line has slope ^|. This is at angle —30 ° from the positive x axis. Thus the required line is either at angle —90 ° or angle 30 ° from the positive x axis. The former determines the vertical line x = — 3. The latter determines the line with slope ^= through [—3,0], hence has equation
yy/3 = x + 3. □
Solution. (Alternative) Notice that there are two such lines. The general equation of a line in the plane has the form
ax + by + c= 0.   Choose parameters so that   a2 + b2 = 1.
We find such numbers a,b,c £ R, so that all the conditions are satisfied. Since the line passes through [—3,0]), c = 3a. The condition of the angle between lines equals 60 ° then gives
1 „     I \/3a + 3b _    I _ -=cos60° = J-=-L,   tj.   V3=  V3a + 3b .
2 Vl2 I
Angles between subspaces
4.1.19. Definition. Consider finite-dimensional subspaces Ui, U2 in a Euclidean vector space V of arbitrary dimension.
The angle between vector subspaces Ui, U2 is the real number a = p(U\, U2) £ [0, f]
satisfying:
(1) If dim Ux = dimC/2 = 1, t/i = («}, U2 = (w),then
IMIIMI
(2) If the dimensions of Ui, U2 are positive, and if Ui n U2 = {0}, then the angle is the minimum of all angles between the one-dimensional subspaces
a = min{i^((u), (v)); 0    u e U^O    v e U2}.
Such a minimum always exists.
(3) If Ui C U2 or U2 C Ui (in particular if one of them is empty), then a = 0.
(4) If C/i n U2 + {0} and if V1 + V1r\V2 + U2, then
q = <?(l7i n (C7i n t/2)M t/2 n (i7i n t/2)±).
The a«g/e between affine subspaces Qi, Q2 in a Euclidean point space 8n is denned as the angle between their difference spaces Z(Qi), Z(Q,2).
Performing further operations
±1 = a+V3b   and exponentation   1 = a2+3b2+2v/3ab. If we use a2 + b2 = 1, we get
0 = 2b2 + 2V3ab,   tj.   0 = b (b + V3aj .
Together (remember that c = 3a and a2 + b2 = 1)
1
a = ±1,   b = 0,   c = ±3;       a = ± —,   ^) = ='=_2_' c
We can easily check that lines determined by those coefficients
1 3
x + 3 = 0, 2X~~y+2=°
satisfy all the conditions.
□
4.B.12. Determine the equation of all planes so that the angle between every such plane and the plane x + y + z — 1 = 0 is 60°, and further, that they contain the line p : [1,0, 0] +
<(i,i,o). O
4.B.13.   Determine the angle between the planes
a:       [1, 0, 2] + (1, -1,1)* + (0,1, -2)s p:       [3, 3, 3] + (1, -2, 0)t + (0,1, l)s
Solution. The line of intersection between the planes has direction vector (1,-1,1). The plane orthogonal to this vector
Notice that the angle is always well defined. In the last case,
(E/i n (t/i n t/2)±) n (U2 n (t/i n t/2)±) = {0}
so we can determine the angle according to item (2) Notice also that in the case Ui n U2 = {0}, the subspaces Ui and U2 are perpendicular in terms of the former definitions if and only if the angle between them is 7r/2. However, if the inter-jie_c_tjbn is nontrivial, then they cannot be perpendicular in the forrOer sense.
In order to show the validity of the definition, it remains to show that the vectors u e Ui, v e U2 minimizing the expression for the angle always exist. First a special case:
4.1.20. Lemma. Let v be a vector in a Euclidean space V and U C V an arbitrary subspace. Denote by vi £ U, v2 £ U1- the (uniquely determined) components of the vector v, i.e. v = i>i + v2.
Then the angle <p between the subspace generated by v and
the subspace U satisfies
cosp((v), U) = cosp({v), (vi)) = Proof. By the Cauchy inequality,
\u-v\ \u ■ (vi + v2)\ \u ■ Vi\
\Vl\
IMIIMI
IMIIMI IMIIMII
Vl\
MIIMI Mil
1>1 ■ v\
v\\vi\
v
246
CHAPTER 4. ANALYTIC GEOMETRY
has intersection with the given planes generated by the vectors vektory (1,0,— 1) and (0,1,1). Theangle between these one-dimensional subspaces is 60°. □
4.B.14.   A cube ABCDA'B' G D' is given in standard notation. That is, ABCD and ABGD' are faces and AA, BB are edges. Compute the angle p between AS and AD'.
Solution. It can be assumed that the cube is of side 1 and placed in R3 in such a way that the vertices A, B, C, D have coordinates respectively [0,0,0], [1,0,0] [1,1,0] [0,1,0] and the vertices A',B',G,D' have coordinates respectively [0,0,1], [1,0,1] [1,1,1] [0,1,1]. Thus AS = B - A = (1,0,1), AD' = D'-A = (0,1,1). So
(1,0,1)-(0,1,1) _1
||(1,0,1)||||(0,1,1)|| 2'
hence p = 60°.
□
For further exercises on angles, see .
4.B.15.   Prove that for every n e N and for all positive
2:1,2:2, ...pxneR
n2 < (— + — H-----h— ] -(xx+x2-\-----\-xn).
\x1     x2 xnJ
For what arguments does equality hold?
Solution. It is sufficient to consider the Cauchy inequality
\u ■ v \ < 11 u I I I I d I I
in Euclidean space R™ for the vectors We get
1 1
(1)  n<\--1---1-----1--■ y/xi + x2 H----+xn.
V xi     x2 xn
We obtain the desired inequality squaring (1). The Cauchy
inequality attains equality when vector u is a multiple of v,
that is, when x1 = 2;2 = ■ ■ ■ = xn. □
4.B.16. Vectors u = (u1,u2,u3) and v = (vx,v2,v3) are given. Find a third unit vector such that parallelepiped denned by these three vectors has the greatest possible volume.
Solution. Denote the desired vector by t = (t±,t2,t3). By Proposition ?? the volume of the parallelepiped V3 (0; u, v, t)
for all vectors u e U. This implies that
cosp((v},(u}) < COSp((v},(vxY)
Vl\
Thus the computed vector vx represents the largest possible value of the cosine of angles between all choices of vectors in U. Te cosine function is decreasing on the interval [0, -|]. Hence the smallest possible angle is obtained in this way, and so the claim is proved. □
4.1.21
J&L"
Calculating angles. The procedure in the previous lemma can be understood as follows. Choose the orthogonal projection of the one-dimensional subspace generated by v into the subspace U, and consider the ratio between v and its image. A similar procedure is used in the higher dimension. The problem is to recognize the directions whose projections give the desired (minimal) angle. This is clear in the previous example if we project the larger space U into the one-dimensional (v) first, and then orthogonally back to U. The desired angle corresponds to the direction of the eigenvector of this map. The eigenvalue is the square of the cosine of the angle.
Let Ui, U2 be two arbitrary subspaces in aEuclidean vector space V, Ui n U2 = {0}. Choose orthonormal bases e and e' of the whole space V such that Ui = (ei,..., ek), U2 = {e'1,...,e[).
Consider the orthogonal projection p of the space V on U2. Its restriction on Ui will be denoted by p : Ui —> U2 as before. Similarly, let ip '■ U2 —> U\ be the map which has arisen from the orthogonal projection on U\. In the bases
(ei,...,efc)and(ei,
, these maps have matrices
A =
AT.
'ex
,ei ■ e.
ek
B =
ex
ek
ex
ek)
e' ■ e{ holds for all indices i, j. Consequently B
The composition of maps ip o p : Ux —> Ux has therefore a symmetric positive semidefinite matrix A1A, and ip is adjoint to p. Each such map has only nonnegative real eigenvalues. It has a diagonal matrix with these eigenvalues on the diagonal in a suitable orthonormal basis, see 3.4.7 a 3.4.9.
Now we can derive a general procedure for computing the angle a = p(Ux, U2).
Theorem. In the previous notation, let A be the largest eigenvalue of the matrix AT A. Then (cos a)2 = A.
Proof. Let u e Ux be the eigenvector of the map ip o p corresponding to the eigenvalue A. Consider all eigenvalues Ax,..., \k (including multiplicities), and let u = (ux,..., un) be the corresponding orthonormal basis of Ux containing the eigenvectors. ||u|| = 1, and assume that A = Ai, u = ux-We need to show that the angle between an arbitrary v e Ux and U2 is at least as large as the angle between u and U2. Equivalently that the cosine of the corresponding angle
247
CHAPTER 4. ANALYTIC GEOMETRY
is the absolute value of determinant
"1	Ml	h		h	t2	h
M2	M2		=	Ml	M2	M3
M3	M3			Ml	M2	M3
= ||m x m||.
The sign of the inequality follows from the Cauchy inequality. This becomes equality if and only if t = c(u x m), c e R. The volume therefore could be at most equal to the area of paralleloid denned by vectors m, m (i.e. size of vector (m xm )). Equality holds if and only if
(m x v)
(m x m)|
□
4.B.17. Find the foot of the line passing through the point [0, 0,7] and perpendicular to the plane
p: [0, 5, 3] + (1, 2, l)t + (-2,1,
4.B.18. In Euclidean space R5 determine the distance between the planes
0i : [7, 2, 7, -1,1] + h (1, 0, -1, 0, 0) + Sl (0,1, 0, 0, -1), g2 : [2,4, 7, -4, 2] + t2 (1,1,1, 0,1) + s2 (0, -2, 0, 0, 3),
where t1,s1,t2,s2 e R.
Solution. First compute the orthogonal complement to sum of vectors denning the planes. Form a matrix with rows as the direction vectors of the planes. Then transform this matrix into normal form.
-1 1
1/
So the orthogonal complement is ((0,0,0,1,0)}. The vector (0,0,0,1,0) lies within the orthogonal complement. Transform the matrix into normal form. This shows that the orthogonal complement is one-dimensional. The distance between planes is the length of the perpendicular projection of the vector A1 — A2 into the subspace ((0,0,0,1,0)} for arbitrary points A\ e gi, A2 e g2. Choose e.g. Ax = [7,2,7,-1,1], A2 = [2,4,7,-4,2]. The orthogonal projection Ax - A2 = (5, -2,0,3, -1) to ((0, 0,0,1,0)} is (0,0,0,3,0). The length of (0,0, 0,3,0) gives the desired distance 3. □
A	0	-1	0	°\	A	0	-1	0
0	1	0	0	-1	0	1	0	0
1	1	1	0	1	0	0	1	0
V°	-2	0	0	3/	1°	0	0	0
cannot be greater. By the previous lemma, it is sufficient to discuss the angle between u and ip(u) e U2. Choose v e U±,
v = aiMi H-----h afcMfc, J2i=i ai = IMI2 = 1- Then
||i,s(m)||2 = <p(y) ■ piy) =    o p(y)) ■ v
< \\ip o ip(v)||\\v\\ = \\ip o ip(v)||.
Moreover, the previous lemma gives a formula for computing the angle a between the vector v and the subspace U2
ocsa = M = 11^)11.
IMI
Since A i is the largest eigenvalue, and the sum of squares of coordinates a2 is one,
(cos a)
y(v)\\2 < 11^0^)11
\
y^(Ajg;
\
If m = m, we have ||i,s(m)||2 = Ai||m||2 = angle has the minimal value for this vector.
A2, and thus the
□
4.1.22. Calculating volume. An indication of how to calculate volumes in plane geometry is given at the end of the fifth part of the first chapter (see lllcJZiB 1.5.11). There the notion of orientation played a fundamental role. We can imagine orientation as the decision whether to look at the plane R2 from above or from below. The distinction lies in the order of selecting standard basis vectors e1 and e2 on the unit circle. We proceed in the same way in general:
Orientation of a vector space
Two bases u and v of a real vector space V are said to determine the same orientation if the transformation matrix between them has a positive determinant. By the orientation of a vector space V is meant the equivalence class of bases m with respect to the equivalence denned above, by the sign of the determinant. Equivalent bases in this sense are called compatible with the chosen orientation.
It follows that there exist exactly two orientations on every vector space. From each compatible basis there is a non compatible one by a transformation matrix with a negative determinant.
A vector space with a chosen orientation is called the oriented vector space.
The oriented Euclidean (point) space is a Euclidean point space whose difference space is oriented. In the sequel we consider the standard Euclidean space 8n together with the orientation given by the standard basis of R™.
Let Mi,...,Mfc be arbitrary vectors in the difference space R™, A e f„ a point. As an example of a convex set, the parallelepiped Vk(A; u1,..., m^) c 8n is given by
Vh(A;ux,. ..,uk) = {A + ciMi H-----hcfcMfc;0 < c, < 1}.
248
CHAPTER 4. ANALYTIC GEOMETRY
4.B.19. In Euclidean space R5 determine the distance of planes
a1 : [0,1, 2, 0, 0] + Pl (2,1, 0, 0,1) + gi (-2, 0,1,1, 0), a2 : [3, -1, 7, 7, 3] + p2 (2, 2,4, 0, 3) + q2 (2, 0, 0, -2, -1),
wherepi,gi,p2,g2 G R-
Solution. The sum of the directions a\, a2 is generated by the direction vectors. Denote them by
ui = (2,1,0,0,1),   w2 = (-2, 0,1,1,0), vx = (2,2,4,0,3),   «2 = (2, 0,0, -2,-1).
Find points X1 e u\, X2 e a2, that equal the distance between ai and a2. This requires
X!-X2 = [0,l,2,0, 0]-[3,-1,7,7, 3] + PlUl + Qiu2 - p2V\ - q2v2
and
{X1-X2,u1) = 0, {X1-X2,u2) = 0, {X1-X2,v1) = 0,   {X1-X2,v2) = 0.
Hence
((-3, 2,-5,-7,-3), ui) +pi } + qi (u2,mi)
~P2 (vi,U! } - q2 (v2,u1 } = 0,
((-3, 2, -5,-7,-3), u2 ) +pi (uuu2) + qi (u2,u2)
~P2 (vi,u2 } - q2 {v2,u2 ) = 0,
((-3,2, -5, -7, -3),«i} +p1{u1,v1) + gi (u2,wi}
-P2 {v\,vi) - q2 {V2,V\ ) = 0,
((-3,2, -5, -7, -3),v2 } +pi(u1,v2) + gi {u2,v2 }
-P2 (^1,^2) - <?2 (W2,W2) = 0.
By computing the dot products, we obtain the linear equation system
6pi   -   4gi   -     9p2   -   3g2   = 7, -ipi   +   6gi +   6g2   = 6,
9pi -   33p2   -    q2   = 31,
3pi   -   6gi   -      p2   -   9q2   = -11.
Solve it by forming a matrix and performing elementary row
operations.
/	6	-4	-9	-3	7	\
	-4	6	0	6	6	
	9	0	-33	-1	31	
V	3	-6	-1	-9	-11	/
If the vectors «i,..., uk are linearly independent, we have a fc-dimensional parallelepiped Vk(A; ui..., uk) C 8n. For given vectors «i,... ,uk there are also parallelepipeds of lower dimension
Vi{A;ui),... ,Vk(A;u!,... ,uk)
in Euclidean subspaces A + (ui),..., A + (ui,..., uk) at our disposal.
If ui,... ,uk are linearly independent, the volume is given by
Vol Pk = 0.
Otherwise consider it as in the case of the Gramm-Schmidt orthogonalization
(ui, ...,uk) = (ui,. .. .. . ,ufc_i)±n(ui,... ,uk).
In this decomposition, uk is uniquely expressed as
uk = u'k + ek
where ek _L {u1:...
The absolute value of the volume of a parallelepiped is defined inductively such that it is the product of the volume of the "base" and the "altitude":
|Vol|Pi(^;ui) = |K || Vol\Vk{A;uu ...,uk) = \\ek\\\ Vol\Vk-iiA\ux,... ,ufc_i).
If ui,..., un is a basis compatible with the orientation of V, the (oriented) volume of the parallelepiped is defined by
Vol ^(A; ui,. ..,un) = \ Vol \Vk{A;uu ...,u„).
In the case of a non compatible basis we set
VoWkiA;^,... ,un) = -| Vol \Vk{A;uu ... ,ti„).
Theorem. Let Q C £n be a Euclidean subspace, and let (ei,..., ek) be its orthonormal basis. For arbitrary vectors «i,..., uk G Z(Q) and A e Q t^d:^ the following holds
"1 ' ei
(1) VolPfc(A;ui,...,ufc)
ui-ek   . .
Ui ■ Ui
(2) rVolVk(A;Ul,...,uk))2 =
Proof. The matrix
<ux ■ ei
A-
\ui ■ ek
«1 ■ uk
uk ■ e1
uk ■ ekj
uk ■ e1
uk ■ ek ■■    uk ■ Ml
• •   uk ■ uk
249
CHAPTER 4. ANALYTIC GEOMETRY
/ 1 0   0 0
0 10 0
0 0 10
\ 0 0   0 1
The solution is (pi, qi, p2, q2) = quently
0 \
-1
-1 2 /
(0,-1,-1,2). Conse-
has the coordinates of the vectors u\, basis in columns, and
., Ufc in the chosen
X±-X2 = (-3, 2, -5, -7, -3) -u2 + vx- 2v2 = (-3,4,-2,-4,2).
The length of the vector (—3,4, —2, —4, 2) equals the distance between the planes a\, a2 and is then
7 = V(-3)2 + 42 + (-2)2 + (-4)2 + 22.
We solved this problem by a method different to that of the previous problem We can use both methods in both cases. Try the former method for the case of a\, a2. Find the orthogonal complement of vector subspace generated by
(2,1, 0, 0,1), (-2, 0,1,1, 0), (2, 2,4, 0, 3), (2, 0, 0, -2, -1). We get
/  2    1   0    0     1 \ -2011 0 2    2   4    0 3 \  2    0 0-2-1/ / 1   0   0   0   3/2 ^ 0   10 0-2 0   0   10 1 \ 0   0   0   1    2 /
The orthogonal complement is ((—3/2,2,-1,-2,1)}, or rather ((3, —4, 2,4, —2)}. Note that the distance between a\ and a2 equals the size of the orthogonal projection of the vector (the difference of an arbitrary point in a\ and an arbitrary point in a2)
u = (3, -2,5,7,3) = [3, -1, 7,7,3] - [0,1,2,0, 0]
to this orthogonal complement. Denote the orthogonal projection of u as pu and choose v = (3, —4, 2,4, —2). Obviously pu = a ■ v for some aeR and
(u-pu,v) = ti,   tj.    (u, v) - a(v, v) = 0.
Computing gives 49 — a ■ 49 = 0. Therefore pu = 1 ■ v = v and the distance between the planes a\ and <72 equals
\\Pu\\ = \/32 + M)2 + 22 + 42 + (-2)2 = 7.
The method of computing the distance using the orthogonal complement of sum of vector spaces proves to be a "faster way to the solution". It is the same for the planes q\ and g2. The second method however reveals points where the distance can be measured (a pair of points in which the planes are the
1^ = 1^1 = 1^11^1 = 1^1
Ui ■ Ui     ...     Uk ■ Ui
Uk ■ Uk
Hence if (1) holds, then also (2) holds.
Directly from the definition, the unoriented volume equals the product
Vol \Vk{A;uu ...,uk) = H^illH^I! • • • IKII,
where«i =u\,v2 = u2 + a\v\,..., vk = Uk + a\vi H-----h
<Jk-ivk-ils the result of the Gramm-Schmidt orthogonaliza-tion. Thus
Ml ' Ml
0
rVolVk(A;Ul,...,uk))2 =
Ml ■ Mi
Ml ■ Vk
0 0
Mfc ■ Mi Mfc ■ Mfc
0
Mfc ■ Vk
Denote by B the matrix whose columns are formed by the coordinates of vectors v±,..., vk in the or-thonormal basis e. Since v1,..., vk have arisen from ui,..., uk as images under a linear transformation VJ 1 with an upper-triangular matrix C with ones on the diagonal, B = CA and |B| = |C||vl| = \A\. But then \A\2 = \B\2 = \A\\A\, and thus Vol;Pfc(,4; mx, ... ,uk) = ±|yl|. The resulting volume is zero if the vectors u±,..., uk are linearly dependent. Provided they are independent, the sign of the determinant is positive if and only if the basis mi ,..., uk defines the same orientation as the basis e. □
Consider a parallelepiped in a fc-dimensional space, which is spanned by k vectors. Write down the coordinates (in an orthonormal basis) into the columns of a matrix. Then the volume of the parallelepiped is the determinant of the matrix.
The formula (2) above is called the Gram determinant. It is independent of the choice of basis and, therefore it is useful when k is lower then the dimension of the whole space.
We formulate the following important geometric consequence:
4.1.23. Corollary. For each linear map ip : V —> V on a
Euclidean space V, det p equals the (oriented) volume of the image of the parallelepiped determined by vectors of an orthonormal basis. More generally, the image of the parallelepiped
V, determined by arbitrary dim V vectors, has a volume
equal to det p-multiple of the former volume.
250
CHAPTER 4. ANALYTIC GEOMETRY
closest). Find such points in the case of planes qi, g2. Denote
ui = (1,0,-1,0,0),   u2 = (0,1,0,0,-1), «i = (1,1,1,0,1),   v2 = (0,-2, 0,0,3).
Points X1 e Qi, X2 e g2, which are the "closest" (as commented above), are
Xx = [7,2,7,-1,1] + siu2,
X2 = [2,4,7,-4,2] +t2Vl + s2v2,
so
Xx-X2 = [7,2,7,-l,l]-[2,4,7,-4,2]
+ fiUi + SlW2 - t2v1 - S2V2
= (5,-2,0, 3,-1)
+ fiUi + SiW2 - t2v1 - S2V2.
The dot products
(X1-X2,u1) = 0,    (X1-X2,u2) = 0, (X1-X2,v1) = 0,    (X1-X2,v2) = 0 then lead to the linear equation system
2ti = -5,
2si + 5s2 = 1,
-4f2 - s2 = -2,
—5si   —       t2 — 13s2 = —1
with the unique solution fi = —5/2, s± = 41/2, t2 = 5/2,
s2 = —8. We obtained
Xx
X2
5 41 [7,2,7,-l,l]--wi + yw2
^9 45 19 39" 2'~2~'~2~'~ '~T
[2,4,7, -4, 2] +- 8v2
"9 45 19 39" 2'~2~'~2~'~ '~T The distance between the points Xi, X2 equals the distance between the planes g\, g2) both of which are given by
\\X1 -X2 || = || (0,0,0,3,0) || = 3. □
4.B.20. Find the intersection of the plane passing through the point A = [1, 2,3,4] e R4 and orthogonal to the plane
g: [1,0,1,0] + (1,2,-1, -2)s+ (1,0,0, l)t, s,teR.
Solution. Find the plane orthogonal to g. Its direction is orthogonal to the direction of g, for vectors (a, b, c, d) within its direction we get linear equation system
(a, b, c, d) ■ (1,2,-1,-2) = 0   =   a + 2&-c-2d = 0 (a, b, c, d) ■ (1,0,0,1) = 0   =   a + d = 0.
4.1.24. Outer product and cross product of vectors. The
previous considerations are closely related to the tensor product of vectors. We do not go further in this technically more complicated topic. But we do mention the outer product n = dim V of vectors ui,..., un e V.
Let (uij,.
be coordinate expressions of vectors
Uj in a chosen orthonormal basis V. Let M be a matrix with elements (tiy). Then the determinant \M\ does not depend on the choice of the basis. Its value is called the outer product of the vectors «i,...,un, and is denoted by [u1,..., un]. Hence the outer product is the oriented product of the corresponding parallelepiped, see 4.1.22.
Several useful properties of the outer product follow directly from the definition
(1) The map (ui,..., un) i-> [u±,..., un] is an antisymmetric n-linear map. It is linear in all arguments, and the interchange of any two arguments causes a change of sign.
(2) The outer product is zero if and only if the vectors ui,..., u„ are linearly dependent.
(3) The vectors u1,... ,un form a positive basis if and only if the outer product is positive.
Consider a Euclidean vector space V of dimension n > 2 and vectors «i,..., G V. If these n — 1 vectors are substituted into the first n — 1 arguments of the n-linear map denned by the volume determinant as above, then there is one argument left over. This defines a linear form on V. Since the scalar product is available, each linear form corresponds to exactly one vector. This vector v e V is called the cross product of the vectors ui,...,       . For each vector w e V
(v,w) = [uu . . .,u„_i,wj-
We denote the cross product by v = u± x ... x un-i.
If the coordinates of the vectors in an orthonormal basis are v = (yi,... ,yn)T, w = (x1:.. .,xn)T and uj =
)T, then the definition can be expressed as
yixi H-----h ynxn
"11
Ul(n-l) Xl
Hence the vector v is determined uniquely. Its coordinates are calculated by the formal expansion of this determinant along the last column. The following properties of the cross product are direct consequences of the definition:
Theorem. For the cross product v = ui x ... x
(1) v G (ui,... ,u„_i)±
(2) v is nonzero if and only if the vectors «i,..., are linearly independent,
(3) the length \\v\\ of the cross product equals the absolute value of the volume of parallelepiped P(0;ui,... ,u„_i),
(4) (ui,..., un-i, v) is a compatible basis of the oriented Euclidean space V.
251
CHAPTER 4. ANALYTIC GEOMETRY
The solution is the two-dimensional vector space ((0,1,2,0), (-1,0,-3,1)}. The plane r orthogonal to q passing through a has parametric equation
r : [1, 2, 3,4] + (0,1, 2, 0)u + (-1, 0, -3, l)v,   u,v eR.
We can obtain the intersection of the planes from both parametric equations. It is given by the linear equation system
l+s+t = 1-v
2s = 2 + u
1 — s = 3 + 2w — 3i>
-2s + t = A + v,
which has the unique solution (it must be so as matrix columns are linearly independent) s = —8/19, t = 34/19, u = —54/19, v = —26/19. Substitute the parameter values s and t into the parametric form of the plane g, to obtain the intersection [45/19,-16/19,11/19,18/19]. (Needless to say, the same solution is obtained by substituting the values into r). □
4.B.21. Find a line passing through point [1,2] e R2 so that the angle between this line and the line
p: [0,l]+i(l,l)
is 30°.
Solution. The angle between two lines is the angle between their direction vectors. It is sufficient to find the direction vector v of the line. One way to do so is to rotate the direction vector of p by 30°. The rotation matrix for the angle 30° is
/^cos30°   -sin30°^ _ (^ -\
~    i %/3
2 2
^sin 30°    cos 30 The desired vector v is therefore
v^s 1
v^s l
We could perform the backward rotation as well. The line (one of two possible) has parametric equation
[l,2]+(----,-- + -V-
□
4.B.22.   An octahedron has eight faces consisting of equilateral triangles. Determine cos a, where a is the angle between two adjacent faces of a regular octahedron. Solution. An octahedron is symmetric, therefore it does not matter which two faces are selected.    By suitable
Proof. The first claim follows directly from the defining 1 formula for v. Substituting an arbitrary vector Uj for w gives the scalar product i, on the left and the determinant with
two equal columns on the right. The rank of the matrix with n
1 columns Uj is given
by the maximal size of a non-zero minor. The minors which define coordinates of the cross product are of degree n — 1 and thus claim (2) is proved.
If the vectors «i,..., are linearly dependent, then (3) also holds. Suppose the vectors are linearly independent. Let v be their cross product, and choose an orthonormal basis (ei,..., e„_i) of the space (ui,..., It follows from
what is proved that there exists a multiple (l/a)v, 0 ^ a e R, such that (ei,..., ek, (l/a)v) is an orthonormal basis of V. The coordinates of the vectors in this basis are
Uj = (uiJ-,...,u(n_1)j-,0)T,   v = (0, ...,0,q)t.
So the outer product [ui,..., , v] equals (see the definition of cross product)
0
"11
^l(n-l)
[Ml, . . . ,ll„-l,«] =
U(n-l)l     ■■■     M(n-l)(n-l) U
0       ... 0 a
2
= (i>, v) = a
By expanding the determinant along the last column,
a2 = aVolV(0;ui,...,in-i)-
Both the remaining two claims follow from the proposition below. □
In technical applications in R3, the cross product is often used. It assigns a vector to any pair of vectors.
4.1.25. Affine and Euclidean properties. Now we can consider which properties are related to the affine structure of the space and which properties we really need in the difference space.
All Euclidean transformations, (bijective affine maps) jjfi I. which preserve the distance between points, preserve also all objects we have studied. Moreover they preserve unoriented angles, unoriented volumes, angle between subspaces etc. If we want them to preserve also oriented angles, cross products, volumes, then we must also assume that the transformations preserve the orientation.
We ask: Which concepts of Euclidean geometry are preserved under affine transformations?
Recall first that an affine transformation on an n-dimensional space A is uniquely defined by mapping n + l points in general position, that is, by mapping a one n-dimensional simplex. In the plane, this means choosing the image of any nondegenerate triangle. Preserved properties are properties related to subspaces. In particular, incidence properties of the type "a line passing through a point" or "a plane contains a line" etc. are preserved.
252
CHAPTER 4. ANALYTIC GEOMETRY
[0,0,^]
■^,0,0],
scaling, the octahedron has edge length 1 and is placed in the standard Cartesian coordinate system R3 so that its centroid is at [0,0,0]. Its vertices then are located at the points A = [^,0,0], B = [0,^2,0], C D = [0,-^,0], E = [0,0,-^2] and,F :
We compute the angle between the faces CDF and BCF. We need to find vectors orthogonal to their intersection and lying within respective faces, which means orthogonal to CF. They are altitudes from D and F to the edge CF in the triangles CDF and BCF respectively. The altitudes in an equilateral triangle are the same segments as ythe medians, so they are SD and SB, where S is midpoint of CF. Because the coordinates of points C and F are known, S has coordinates [— -^2,0, -^2] and the vectors areS£> = (^2,-^2,-^ gether
4 jandSB = (Í2,f,-^2).To-
(s/2 V 4 '	%/2 2 '	%/2\ 4 J	(s/2 V 4 '	s/2 2 '	-f) 1
ll(f ,	\/2 2 '	-%\	\\\{%	\/2 2 '	-^)ll 3
Therefore a = 132°.
□
4.B.23. In Euclidean space R5 determine the angle p between subspaces U, V, where
(a) U : [3, 5,1, 7, 2] + t (1, 0, 2, -2,1), « g R,
V : [0,1, 0, 0, 0] + s (2, 0, -2,1, -1), s£ R;
(b) [/ : [4,1,1, 0,1] + t (2, 0, 0, 2,1), te R,
V : xi + x2 + x3 + x5 = 7;
(c) U :2xi-x2 + 2a;3 + x5 = 3,
V : xi+ 2x2 + 2x3 + x5 = -1;
(d) [/ : [0,1,1, 0, 0] + t (0, 0, 0,1, -I), te R,
V: [1,0,1,1,1] + r (1,-1, 2,1,0) + s (0,1, 3,2,0) + p (1,0, 0,1,0)+? (1,3,1,0,0),
r, s,p,qe R;
(e) [/ : [0, 2, 5, 0, 0] + t (2,1, 3, 5, 3) + s (0, 3,1,4, -2)
+ r (1,2,4, 0,3), t,s,r e R, V: [0,0, 0,0,0] +p (-1,1,1,-5,0)
+ q (1,5,1,13, -4), p,geR;
(f) t7: [1,1,1,1,1] +«(1,0,1,1,1)
+ s (1,0, 0,1,1), «,sgR, V: [1,1,1, l,l]+p(l, 1,1,1,1)+? (1,1, 0,1,1) + r(l,l,0,l,0),p, q,r g R.
Solution. Recall that the angle between affine subspaces is the same as the angle between vector spaces associated to
Moreover, the collinearity of vectors is preserved. For every two collinear vectors, the ratio of their lengths is preserved independently of the scalar product defining the length. Similarly, the ratio of the volumes of two n-dimensional parallelepipeds is preserved under the transformations, since the determinant of the corresponding matrix changes by the same multiple.
These affine properties can be used in the plane to prove geometric statements. For instance, to prove the fact that the medians of a triangle intersect in a single point, and in one third of their lengths, it is sufficient to verify this only in the case of an isosceles right-angled triangle or only in the case of an equilateral triangle. Then this property holds for all triangles. Think about this argument!
2. Geometry of quadratic forms
After straight lines, the simplest objects in the analytic geometry of plane are the conic sections. These are given by quadratic equations in Cartesian coordinates. A conic is distin-■!(//' ■ guished as a circle, ellipse, parabola or hyperbola, by examining the coefficients. There are two degenerate cases, namely a pair of lines or a point. We cannot distinguish a circle from an ellipse in affine geometry, therefore we begin with Euclidean geometry.
4.2.1. Quadrics in £n. In analogy with the equations of conic sections in plane, we start with objects in Euclidean point spaces. These are defined in a given orthonormal basis by quadratic equations, and are known as quadrics.
Choose a fixed Cartesian coordinate system in £n. This is a point and an orthonormal basis of the difference space. Consider a general quadratic equa-
tion for the coordinates (x1,
<- Ae£n
of a point
(l)      ^2 aijxixj + ^2 ^aiXi + ° = 0,
i,j=l i=l
where it may be assumed by symmetry that a{j = aj{ without loss of generality. This equation can be written as
f(u) + g(u) + a = 0
for a quadratic form / (i.e. the restriction of a symmetric bilinear form F to pairs of equal arguments), a linear form g, and a scalar a e R. We assume that at least one coefficient dij is nonzero. Otherwise the equation is linear and describes a Euclidean subspace.
Notice that every Euclidean (or affine) coordinate transformation transforms the equation (1) into the same form with a quadratic, linear and constant part.
4.2.2. Quadratic forms. Begin the discussion of equation (1) with its quadratic part, i.e. bilinear symmetric form F : Rn x Rn —> R. Similarly, think of a general symmetric bilinear form on an arbitrary vector space.
253
CHAPTER 4. ANALYTIC GEOMETRY
them. Therefore the translation caused by the point addition can be omitted.
Case (a). Since U and V are one-dimensional spaces, the angle p e [0, tt/2] is given by formula
_        (l,0,2,-2,l)-(2,0,-2,l,-l)        _ 5 COSy- || (1,0,2,-2,1) ll-H (2,0,-2,1,-1) || - vT0-\/l0'
Therefore cos p = 1/2 and p = tt/3.
Case (b). The subspace U has direction vector (2,0,0,2,1) and the subspace V has normal vector (1,1,1,0,1). The angle between them ip = tt/3 is derived from the formula
cos ip ■
(2,0,0,2,!)■(!,1,1,0,1)
3 3-2'
(2,0,0,2,1) ll-H (1,1,1,0,1)
Notice that p = n/2 — ip = n/Q, because p is complement to ip.
Case (c). The hyperplanes U and V are denned by normal vectors u = (2, -1,2, 0,1) and v = (1,2, 2,0,1). The angle p equals to angle between the direction vectors aat, Therefore (see (a))
_        (2,-1,2,0,!)■(!,2,2,0,1)        _ 1       .. _ tt_
COSiy — || (2,-1,2,0,1) 11-11 (1,2,2,0,1) || — 2'     lJ-     ^ — 3'
Case (d). Denote
«=(0,0,0,1,-1),   Vl = (1,-1,2,1,0), w2 = (0,1,3, 2,0), v3 = (1,0,0,1,0), v4 = (1,3,1,0,0) and denote the orthogonal projection of u into the vector sub-space of V (subspace generated by v1, v2, v3, v4) by pu. Now
pu = avi + bv2 + cv3 + dv4   for some a, b, c, d G R and
(Pu - u, «i) = 0, (pu-u, v2) = Q, {pu-u, v3) = 0,    {pu-u,v4) = 0.
Substituting for pu gives the linear equation system
7a   +    lb   +   2c = 1,
7a   +   lib   +   2c   +    6d   = 2, 2a   +    2b   +   2c   +      d   = 1, 6b   +    c   +   lid   = 0. The solution is (a,b, c, d) = (-8/19, 7/19,13/19,-5/19).
'\Pu\\
(1)
and so
cos p ■
8 7        13 5
Pu = ~19Vl+ 19V2 + 19V'3 ~ 19V4 = (°' °' °' X'0)'
cosy
\Pu
(0,0, 0,1,0)
1 _ V2 V2~~'
,\u\\ ||(0,0,0,1,-1) Hence p = tt/4.
Case (e). Determine the intersection of the vector subspaces associated with the given affine subspaces. The
For an arbitrary basis on this vector space, the value f(x) on vector x = x1e1 + ■ ■ ■ + xnen is given by the equation
f(x) = F(x, x) =      XiXjF(ei, ej) = xT ■ A ■ x
where A = (a^) is a symmetric matrix with elements a{j = F(ei, Bj). We call such maps / quadratic forms, and the formula from above for the value of the form in terms of the chosen coordinates is called the analytic formula for the form.
In general, by a quadratic form is meant the restriction j(x) of a symmetric bilinear form F(x, y) to arguments of the type (x, x). Evidently, the whole bilinear form F can be reconstructed from the values j(x) since
fix + y)=F(x + y,x + y) = f(x) + f(y) + 2F(x, y).
If we change the basis to a different basis e'l5..., e'n, we get different coordinates x = S ■ x' for the same vector (here S is the corresponding transformation matrix), and so
f(x) = (S ■ x'f -A-(S-x') = (x'f ■ (ST ■ A ■ S) ■ x'.
Assume now that the vector space is equipped with a scalar product. Then the previous computation can be formulated as follows. The matrix of the bilinear form F, which is the same as the matrix of /, transforms under a change of coordinates in such a way that for orthogonal changes it coincides with the transformation of a matrix of a linear map (i then S~1 = ). This result can be interpreted as the following observation:
Proposition. Let V be a real vector space with a scalar product. Then formula
p n> F,   F(u,u) = (p(u),u)
defines a bijection between symmetric linear maps and quadratic forms on V.
Proof. Each bilinear form with a fixed second argument becomes a linear form au( ) = F( ,u). In the presence of a scalar product, it is given by the formula a(u)(v) = v ■ w for a suitable vector w. Put p(u) = w. Directly from the coordinate expression displayed above, p is a linear map with matrix A. Hence it is selfadjoint.
On the other hand, each symmetric map p defines a symmetric bilinear form F by formula F(u, v) = (p(u),v) = (u, p(y)), and thus is also a quadratic form. □
It is immediate that for each quadratic form / there exists an orthonormal basis of the difference space in which / has a diagonal matrix. The values on the diagonal are determined uniquely up to their order.
Due to the identification of quadratic forms with linear maps, the rank of the quadratic form can be defined as the rank of its matrix in any basis. The rank equals the dimension of the image of the corresponding map p).
254
CHAPTER 4. ANALYTIC GEOMETRY
vector (xi,x2, x3, 24,25) is in the vector subspace of U, if and only if
(x!,x2, 23,24,25) = t (2,1, 3, 5, 3) + s (0, 3,1,4, -2)
+ r (1,2,4,0,3) for some t,s,r    e    R. Similarly,
(21,22,23,24,25) £ V if and only if
(21,22,23,24,25) = p (-1,1,1,-5,0) + g (1,5,1,13,-4)
for some p, g e R. We look for such t, s, r,p,q e R, so that
t (2,1, 3, 5, 3) + s (0, 3,1,4, -2) + r (1, 2,4, 0, 3) = p(-l,l,l,-5,0) + g(l,5,l,13, -4).
It is a homogeneous linear equation system. It is solved in matrix form (order of variables is t, s, r, p, q)
4.2.3. Classification of quadrics. We return to the equation (1). The above results enable us to rewrite this equation as
/	2	0	1	1	-1 ^	
	1	3	2	-1	-5	
	3	1	4	-1	-1	
	5	4	0	5	-13	
V	3	-2	3	0	4 )	
		(1	3	2	-1 -5	\
		0	2	1	-1 -3	
		0	0	1	-1 1	
		0	0	0	0 0	
		V 0	0	0	0 0	/
The vectors defining V are linear combination of the vectors of U. So V is subset of U, and hence ip = 0.
Case (f). Find the intersection of U and V. Search for numbers t,s,p,q,r e R such that
i (1,0,1,1,1) + * (1,0,0,1,1) =p (1, 1,1,1,1) + q (1,1, 0,1,1) + r (1,1, 0, 1, 0) .
The solution is (t, s, p, q, r) = (—a, a, —a, a, 0), a£l, The intersection Z(U) n Z(V) of vector spaces U and V contains the vectors
(0, 0, -a, 0, 0) = -a (1, 0,1,1,1) + a (1, 0, 0,1,1) = -a(l,l,l,l,l) + a(l, 1,0,1,1) + 0(1,1,0,1,0),
where  a     e     R.      Z(U)  n Z(V)  is generated by     (0,0,1,0,0)     and    its     orthogonal complement   (Z(U)   n   Z(V))1-   is   generated   by vectors
(1,0,0,0,0), (0,1, 0,0, 0), (0, 0,0,1,0), (0,0,0,0,1). We obtain
z(u)nz(v)^{0}, z(u)nz(v)^z(u), z(u)c\Z(v) + z(y).
The angle ip is defined as the angle between the subspaces
z(u) n (Z(U) n z(V))-1 a z(v) n (Z(u) nz(v))±.
y] Ajzf+^2  +& -■
0.
Hence we may assume that the quadric is given in this form.
In the next step, we '"complete the square'" for the coordinates Xi with \i ^ 0, which "absorbs" the squares together with the linear terms in the same variable. So only linear terms are left corresponding to variables for which the coefficient of the quadratic term is zero. We have
+ y^ bjXj + c = 0.
where the summation over j is only for j satisfying Xj = 0.
This corresponds to a translation of the origin about the vector with coordinates pi. To such a choice of basis of the difference space the desired diagonal form is in the quadratic part. In the identification of quadratic forms with linear maps derived above, this means that ip is diagonal on the orthogonal complement of its kernel. If there are also some linear terms, the orthonormal basis of the difference space can be adjusted for the kernel of ip such that the corresponding linear form is a multiple of the first term of the dual basis. Hence the final formula
k
^2 X{yi + hvk+i + c = 0,
i=l
where k is the rank of matrix of quadratic form f. lib ^ 0, it can be arranged that the constant c in the equation is zero by a further change of the origin.
Hence the linear term may (but does not have to) appear only in the case that the rank of / is less than n. c e R may be nonzero only if b = 0. The resulting equations are called the canonical analytic formulas for quadrics.
4.2.4. The case of £2. As an example of the previous procedure, we discuss the simplest case of a non-7   trivial dimension, namely dimension two. The •*     original equation has the form
aux2 + a22y2 + 2a±2xy + a±x + a2y + a = 0.
By a suitable choice of a basis of the difference space, and the subsequent completion of the square, it is written in the form (using the same notation x, y for the new coordinates):
aux2 + a22y2 + aix + a2y + a = 0
where a{ is nonzero only in the case that ai{ is zero. By the last step of the general procedure, exactly one of the following equations is involved:
255
CHAPTER 4. ANALYTIC GEOMETRY
It is now established that
z(u) n (Z(u) n z(v))± = ((1, o, o, 1,1)},
z(v) n (Z(U) n z(v))± = ((1,1, o, 1,1), (1,1, o, 1, o)}.
It is enough to express Z(U) as a linear combination of vectors (0,0,1, 0,0), (1,0,0,1,1) and Z(V) by the vectors (0,0,1,0,0), (1,1,0,1,1), (1,1, 0,1,0). Since the dimension of Z(U) n (Z(U) n ZiV))1- is 1, we can use the formula (1), where u = (1,0,0,1,1) and pu is the orthogonal projection of u into Z(V) n (Z(U) n Z(V))^. Then p„ = a(l,l,0,l,l)+b(l,l,0,1,0)
and
(Pu- «,(1,1,0,1,1)} = 0, (pu- «,(1,1,0,1,0)} =0,
which leads to the linear equation system
4a + 3b = 3, 3a   +   3b   = 2
with the unique solution a = 1, b= —1/3. Thus
^=(§,§,0,2,1). From (1) it follows that
COSp^\\M^^fA^^.   y = 0.49 (-28°).
□
C. Geometry of quadratic forms
4.C1.   Determine the polar basis of the form / : R3 —> R,
f(x1,x2,x3) = 3x2 + 2x1x2 +x2 + 4x2x3 + Qx3.
Solution. Its matrix is
/3   1 0\ A= 11   1   2 . \0   2 6/
According to step (1) of the Lagrange algorithm (see Theorem 4.2.5), perform the following operations
1 2
f(x1,x2,x3) = g (3xi + x2)2 + -x22 + 4x2x3 + Qxl
= \vl + \(\v2 + 2y3)2
1    o 3 9
The form has rank 2 and the matrix changing the basis to the polar basis w is obtained by a combination of following transformations: z3 = y3 = x3, z2 = \y2 + 2y3 = \x2 + 2x3 and z\=y\ = 3x\ + x2, so the change of basis matrix is
/3   1 0\
x2/a2 + y2/b2 + 1 x2/a2 + y2/b2-l
x2/a2 — 2py x2/a2 + y2/b2 x2/a2-y2/b2
0 : 0 :
0 = x2/a2 - y2/b2 - 1
0 :
0 :
0 :
0 :
0 :
0 :
empty set
ellipse
hyperbola
parabola
point
2 concurrent lines 2 parallel lines 2 identical lines empty set
The origin of the Cartesian coordinates is the center of the studied conic. The new orthonormal basis of the difference space gives the direction of semiaxes. The final coefficients a, b then give the lengths of the semiaxes in the nonde-generate directions.
4.2.5. Alfine point of view. In the previous two paragraphs, we searched for essential properties and standardized analytical descriptions of objects de-sS^M fined in Euclidean spaces by quadratic equations. We sought the simplest equations which can be obtained by a suitable choice of coordinates. A geometric formulation of the result is that for two different quadrics, given in different Cartesian coordinates, there exists a Euclidean transformation on 8n (that is, an affine bijective map preserving lengths) if and only if the above algorithm leads to the same analytic formulas, up to the order of coordinates. Moreover, the Cartesian coordinates in which the objects are given by the resulting canonical formulas, can be obtained directly. Hence the explicit expression of the corresponding coordinate transformation is also obtained. It is always a composition of a translation, rotation and reflection with respect to a hyper-plane.
Of course, we may ask to what extent we can do the same in affine spaces, where we can choose any coordinate system. For example, in the plane we cannot distinguish the circle from the ellipse. On the other hand, we can distinguish from the hyperbola and between all other types of conies. In particular, all hyperbolas merge into one etc. We postpone discussion of this issue to the third part of this chapter, except for the case of quadratic forms.
Consider a quadratic form / on a vector space V and its
analytic formula f(u)
Ax with respect to a chosen basis
on V. Then for the vector u = xiui + f can be written as
+ xnun, the form
T :
0 f 2 ^0   0 1,
It is already shown that A is diagonal for a suitable choice of basis. In other words that F(ui, uj) = 0 for i =/ j for a suitable symmetric form F. Each such basis is called the polar basis of the quadratic form /. A scalar product can always be chosen for such a purpose. Nevertheless, without the use of the scalar product, there is a much simpler algorithm for finding a polar basis among all other bases. At the same
256
CHAPTER 4. ANALYTIC GEOMETRY
We computed the polar coordinates, expressed them in standard basis and wrote them as rows of the matrix (the columns of this matrix are vectors of the standard basis in the polar basis). The polar basis vector coordinates are the columns of the matrix T-1.
,0 0
The polar basis is therefore ((±,0,0), (-§,§,0), (1,-3,1)).
□
4.C.2.   Determine the polar basis of the form / : R3 -> R3.
f(x1,x2,x3) = 2x±x3 + x\.
Solution. The matrix is of the form
Change the order of the variables: y\ = x2, y2 = x\, y3 = x3. It is then trivial to apply step (1) of Lagrange algorithm (there are no common terms). However for the next step, case (4) sets in. Introduce the transformation z1 = y1,
z2 = V2, z3 = y3- y2.
f(x1,x2,x3) = z2+2z2(z3+z2) = z\ + ]-{2z2+z3)
Together, z1 = y1 = x2, z2 = y2 = x1, z3 = y3 - y2 = x3 — x1. The matrix T for change to polar basis is
/0   1 0> and   T_1 =   1   0 0 \0   1 1)
The polar basis is therefore ((0,1,0), (1,0,1) (0,1,1)). □
4.C.3. Find the polar basis of the quadratic form / : R3 R, which in the standard basis is defined as
time, there is relevant information to be found about the affine properties of the quadratic form.
Theorem. Let V be a real vector space of dimension n, f : V->la quadratic form. Then there exist a polar basis for f onV.
(ui,..
f(xi
Proof. (1) Let A be the matrix of / in basis u = ..., un) on V, and assume an =^ 0. Then we may write
, xn) = anx1 + 2a12x1x2 H----+ a22x2 + .
= a^l(a11x1 + a12x2 H-----h alnxn)2
+ terms not containing x1.
Hence we can transform the coordinates (i.e. change the basis) such that in the new coordinates
x[ = a11x1 + a12x2 H-----h alnxn, x'2 = x2,.. ., x'n = xn.
This corresponds to the new basis
Vl
hi Ul' V2 = U2-0-U ai2"i, •
-1,
alnux.
(As an exercise, compute the transformation matrix). In the new basis the corresponding symmetric bilinear form satisfies g(vi,Vi) =0 for alii > 0 (compute it!). Thus / has the form aiix'i2 + h m the new coordinates, where h is a quadratic form independent of the variable x1.
It is often easiest to choose v1 = ui in the new basis. Then / = fi + h, where /i depends only on x'n while x[ does not appear in h, but g(v1,v1) = an.
(2) Assume that after step (1), h is a matrix of rank less than n with a nonzero coefficient of x'2 2. Then the same procedure can be repeated to obtain the expression / = /i +f2 Ah, where h contains only the variables with index greater than two. Proceed in this way until a diagonal form is obtained after n — 1 steps, or in (say) the i-th step, the element ai{ is zero.
(3) If the last possibility occurs, and there exists some
1 2    other element ajj ^ 0 with j > i, then it suffices to exchange
2 z'3'   the i-th and the j-th vector of the basis. Then continue ac-
cording the previous procedure.
(4) Assume that the situation is ajj = 0 for all j > i. If there is no element a^ ^ 0 with j > i, k > i, then we are finished, since then the matrix is diagonal. If a^ ^ 0, then we use the transformation Vj = Uj + uk + we keep the other vector of basis constant (i.e. x'k = xk — Xj, the other remain constant). Then h(vj, Vj) = h(uj, Uj) + h(uk, uk) + 2h(uk,Uj) = 2ajk ^ (1).
0 and we can continue as for case
□
f(xi, x2, X3) = XXX2 + XxX3.
4.2.6. Affine classification of quadratic forms. The vectors can be rescaled from the basis by a scalar such that the coefficients of the squares of variables are only the scalars 1,-1 and 0. Moreover, the following law of inertia says that the number of one's and minus one's does not depend on the choices in the course of the algorithm. These numbers are called the signature of a quadratic form. As before, there is a complete description of quadratic forms in the sense that
257
CHAPTER 4. ANALYTIC GEOMETRY
Solution. By an application of the Lagrange algorithm:
f(x1:X2, X3) = 2xXX2 + X2X3
substitution y2 = X2 — xi, yi = xi, j/3 = X3
= 2x1(x1 + y2) + (xi + y2)x3 = 2x\ + 2xxy2 + xix3 + y2x3
1, 1     ,2       1   2      1 2
= -{2Xl + y2 +-x3)  - -y2 - -x3 +y2x3 substitution yi = 2xi + yi + -|x3
2^1 ~        - gx3 +^2^3
1   2     o/1 1     12 3
substitution 2/3 = -|j/2 — |x3
With the coordinates yi,y3,x3, the quadratic form has a diagonal shape, which means that the basis associated with those coordinates is the polar basis of the form. If we want to express the basis, we need to obtain the matrix which changes the basis from polar to standard. By definition of the change of basis matrix, its columns are the polar basis vectors. Either we express the old variables (x1, x2, x3) by new variables (yi, 2/3. 23), or equivalently we express the new ones by the old ones (which is easier). In the latter case, we need to compute inverse matrix.
yi = 2x1 + y2 + \x3 = 2x1 + (x2 - xi) + \x3 and y3 = \yi — \x3 = —\x\ + t;x3 — 5X3. The matrix for changing the basis from the polar basis to the standard basis is
The inverse matrix is
T"
_ 2
3 0
Hence one of the polar bases of the given quadratic  forms  is   (see  the  columns  of the matrix),
{(1/3,1/3,0), (-2/3,4/3,0), (-1/2,1/2,1)}. □
4.C.4.   Determine the type of conic section denned by
3x2 — ?>xix2 + x2 — 1 = 0.
two such forms may be transformed each one into the other by an affine transformation if and only if they have the same signature.
Theorem. For each nonzero quadratic form of rank r on a real vector space V there exists a natural number p, and r independent linear forms <pi,..., <pr G V* such that 0 < p < r and
f(u) = (p1(u))2 + - ■ ■ + (pp(u))2-(pp+1(u))2-----(Pr(u))2.
Otherwise put, there exists a polar basis, in which f has an analytic formula
/Mi
xf +
+ x2 — x2
p+1
-■x{ + -
+ x2 — x2
p+1 vl+i -
,2
The number p of positive diagonal coefficients in the matrix of the given quadratic form (and thus the number r — p of negative coefficients) does not depend on the choice of polar basis.
Two symmetric matrices A, B of dimension n are matrices of the same quadratic form in different bases if and only if they have the same rank and the same number of positive coefficients in the polar basis.
Proof. By completing the square, f(x1,..., xn) = Aixf + ■ ■ ■ + \rx2, Xi ^ 0, in a basis on V. Assume moreover that the first p coefficients A, are positive. Then the transformation y± = \tr\±x1, ...,yp = \f\vxv, yp+1 = \J~Ap+iXp_|_i,yr = \/—\rxr, yr+i = xr_|_i,..., yn = xn yields the desired formula. The forms pi are exactly the forms from the dual basis in V* to the obtained polar basis.
It remains to prove that p does not depend on the procedure. Assume that there is a formula for the same form / in the polar bases u, v, i.e.
/(xi,...,x
f(yi, ■ ■ ■ ,yn) = yf h-----1- yq - y"q+i-----y;
Denote the subspace generated by the first p vectors of the first basis by P = {ui,... ,up), and similarly Q = (vq+i,..., vn). Then for each u e P, j(u) > 0 while for v e Q f(v) < 0. Hence necessarily P n Q = {0}, and therefore dim P + dim Q < n. Hence p + (n — q) < n, so that p < q. By interchanging the subspaces, q < p, and so
p = q-
Thus p is independent of the choice of the polar basis. Consequently for two matrices with the same rank and the same number of positive coefficients in the diagonal form of the corresponding quadratic form, the analytic formulas are the same. . □
While discussing symmetric maps we talked about definite and semidefinite maps. The same discussion has an obvious meaning also for symmetric bilinear forms and quadratic forms. A quadratic form / on a real vector space V is called
(1) positive definite if j(u) > 0 for all vectors u^0,
(2) positive semidefinite if j(u) > 0 for all vectors u e V,
(3) negative definite if j(u) < 0 for all vectors u/0,
(4) negative semidefinite if j(u) < 0 for all vectors u e V,
258
CHAPTER 4. ANALYTIC GEOMETRY
Solution. Complete the squares:
3x\ - 3xľx2 + x2 - 1 = ^(3^1 - 7^x2)2 -       + x2-l
1 2    4 3 3yi 3V 1 2_12_2
2yi    3y2 3'
= ôí/i-ô(7a;2-2) +3_1
According to the list 4.2.4, the given conic section is a hyperbola. □
4.C.5.   By completing the squares, express the quadric
-x2 + 3y2 + z2 + 6xy - Az = 0 in such a way that one can determine its type from it. Solution. Complete the square. Deal first with all terms involving an x. Obtain the equation
-{x - 3y)2 + 9y2 + 3y2 + z2 - 4z = 0.
There are no "unwanted" terms containing y , so repeat the procedure for z. This gives
-(x - 3y)2 + Yly2 + (z - 2)2 - 4 = 0.
Conclude that there is a transformation of variables that leads to the equation (we can divide by 4 if desired)
-x2 + y2 + ž2 - 1 = 0.
□
We can tell the type of the conic section without transforming its equation to the form listed in 4.2.4. Every conic section can be expressed as
axlx2 + 2a12xy + a22y2 + 2a13x + 2a23y + a33 = 0.
Determinants    A  = detA
an   ai2 ar3 «12   «22   «23 and
0-13    a32 a33
o = are invariants of conic sections which means
0-12 a22
that they are not changed by Euclidean transformations (rotation and translation). Furthermore, the different types of conic sections have different signs of those determinants.
• A ^ 0   for non-degenerate conic sections:
ellipse for S > 0, hyperbola for S < 0 and parabola for 8 = 0
For a real ellipse (not imaginary), it is necessary that (an + «22)^ < 0.
• A = 0   for degenerate conic sections, or pairs of lines.
The signs (or zero-value) of the determinants are really in-variant to the coordinate transformation. Denote X = y
(5) indefinite if f(u) > 0 and f(v) < 0 for two vectors
u, v £ V.
The same names are used for symmetric matrices corresponding to quadratic forms. By the signature of a symmetric matrix is meant the signature of the corresponding quadratic form.
4.2.7. Theorem (Sylvester criterion). A symmetric real matrix A is positive definite if and only if all its leading principal minors are positive.
A symmetric real matrix A is negative definite if and only if(—l'y\Ai\ > 0 for all leading principal submatrices At.
Proof. Analyse in detail the form of the transformations used in completing the square for constructing % the polar basis. The transformation used in the first step always has an upper triangular matrix T. By rescaling, see proposition 4.2.5, the matrix has one's on the diagonal:
T
= 0
«11
Such a matrix of the transformation from basis u to basis v has several useful properties. In particular, its leading principal submatrices Tk formed by the first k rows and columns are the transformation matrices of a subspace Pk = (ui,..., uk) from basis («i,..., uk) to basis (v1..., vk). The leading principal submatrices Ak of the matrix A of the form / are matrices of restrictions of the form / to Pk. Therefore, the matrices Ak and A'k of restrictions to Pk in basis u and v respectively satisfy Ak = TjA^iy-1, where T is the transformation matrix from u to v. The inverse matrix to an upper triangular matrix with one's on the diagonal is again an upper triangular matrix with one's on the diagonal. Hence we may similarly express A1 in terms of A. Thus the determinants of the matrices Ak and A'k are equal by Cauchy formula.
Let f be a quadratic form on V, dim V = n,. Let u be a basis ofV such that the items (3) and (4) from the Lagrange algorithm while finding the polar basis are not needed. Then the analytic formula
= Xixf + \2x2 +
is obtained where r is the rank of the form f, Ai,..., Ar 7^ 0 and for the leading principal submatrices of the (former) matrix A of quadratic form f, \Ak\ = X±X2 ... \k, k < r.
In this procedure, each sequential transformation contains zeros under the diagonal in the next column. Consequently if the leading principal minors are nonzero, then the next diagonal term in A is nonzero. This proves the Jacobi theorem:
Corollary. Let f be a quadratic form of rank r on a vector space V with matrix A for the basis u. Steps other than completing the square are not required if and only if the leading principal submatrices of A satisfy \Ai\ =f^ 0,      \Ar\ =f^ 0.
259
CHAPTER 4. ANALYTIC GEOMETRY
and denote A as the matrix of the quadratic form. Then the corresponding conic section has equation ATAX = 0. The standard form is obtained by rotation and translation. This is by a transformation to new coordinates x', y' satisfying
x   =   x' cos ft — y' sin a + c\ y   =   x' sin ft + y' cos a + c2,
or, in matrix form, for the new coordinates X = I y' I,
w
(1)
(x\ /cos ft
y I = I sin ft v   V 0
Put X = MX into the conic section equation to obtain the equation in new coordinates
XTAX   = 0 (MX)TA(MX)   = 0 X'TMTA MX'   = 0.
Then there exists a polar basis in which f has the analytic formula
f (xi,... , xn)
ii   i   o        -^4-9 o
\Ai\xi + nrix2 H-----1-
Hence if all leading principal minors are positive, then / is positive definite by the Jacobi theorem.
On the other hand, suppose that the form / is positive definite. Then A = PTEP = PTP. for a suitable regular matrix P. Hence \A\ = P|2 > 0. Let u be a chosen basis in which the form / has matrix A. The restrictions of / to the subspaces 14 = (ui,..., uk) are positive definite forms fk again, and the corresponding matrices in the bases ui,..., uk are the leading principal submatrices Ak. Thus \Ak\ > 0 by the previous part of the proof.
The claim about negative definite forms follows by observing that A is positive definite if and only if —A is negative definite. □
Denote by A the matrix of the quadratic form in
new coordinates.   Then A  =  MTAM, where matrix
(cos ft   —sin ft cA sin o;    cos o;    c2   has unit determinant, so 0 0 l)
det A = det MT det A det M = det A = A.
Necessarily, the determinant A33, which is the algebraic complement of a33, is invariant to the coordination transformation. For rotation only, det A = det MT det A det M.
/cos ft   — sin ft 0\ In this case the matrix M   =      sin o;    cos a 0
V   0 0 l)
For translation only,
and detA33 /I 0
M = 0 1 \0 0
unchanged.
=   det^33   = (5.
and  this   subdeterminant remains
3. Projective geometry
In many elementary texts on analytic geometry, the au-jKff.       thors finish with the alline and Euclidean objects jfAJ£      described above. The affine and Euclidean ge-ometries are sufficient for many practical prob-lems, but not for all problems. For instance in processing an image from a camera, angles are not preserved and parallel lines may (but do not have to) intersect.
Moreover, it is often difficult to distinguish very small angles from zero angles, and thus it would be convenient to have tools which do not need such distinguishing.
The basic idea of projective geometry is to extend affine spaces by points at infinity. This permits an easy way to deal with linear objects such as points, lines, planes, projections, etc.
4.C.6.   Determine the type of conic section
2a;2 - 2xy + 3y2 - x + y - 1 = 0.
Solution. The determinant
2 -1 A= -1 3
= -2M0,
hence it is a non-degenerate conic section. Moreover
S = 5 > 0, therefore it is an ellipse. Furthermore
(an + a22)A = (2 + 3) • < °> so itis real ellipse. □
4.3.1. Projective extension of alline plane. We begin with the simplest interesting case, namely geometry in a plane. If we imagine the points in the plane A2 as the plane z = 1 in 7l3, then each point P in the affine plane is represented by a vector u = (x,y,l) £ R3- So it is represented also by a one-dimensional subspace (u) c K3. On the other hand, almost every one-dimensional subspace in R3 intersects the plane in exactly one point P. The vectors of such a subspace are given by coordinates (x, y, z) uniquely up to a common scalar multiple. Only the subspaces corresponding to vectors (x, y, 0) do not intersect the plane.
260
CHAPTER 4. ANALYTIC GEOMETRY
4.C.7.   Determine the type of conic section x2 —4xy—hy2 +
2x + Ay + 3 = 0.
1    -2 1
Solution. The determinant A = -2   -5   2 = -34 ^0,
1     2 3
1 -2
Projective plane
furthermore S hyperbola.
-2 -5
9 < 0, it is therefore a □
4.C.8. Determine the equation and type of conic section passing through the points
[-2,-4],    [8,-4],    [0,-2],    [0,-6], [6,-2].
Solution. Input the coordinates of the points into the general conic section equation
aux2 + a22y2 + 2a12xy + aix + a2y + a = 0
There follows the linear equation system
4an + 16a22 64an + lQa22 4a22 36a22 36an + 4a22
+ 16ai2 — 2ai — 64ai2 + 8ai
— 24ai2 + 6ai
-4a2 + a =0,
-4a2 + a =0,
-2a2 + a =0,
— 6a2 + a =0,
-2a2 + a = 0.
In matrix form we perform operations
Í4	16	16	-2	-4	A
64	16	-64	8	-4	1
0	4	0	0	-2	1
0	36	0	0	-6	1
\36	4	-24	6	-2	V
fA	16	16	-2	-4	1 \
0	4	0	0	-2	1
0	0	64	-8	12	-9
0	0	0	24	-36	27
	0	0	0	3	-2/
	f 48	0	0	0 0	-1
	0	12	0	0 0	-1
	0	0	64	0 0	0
	0	0	0	24 0	3
	\°	0	0	0 3	-2
Then
an = 1,    a22 = 4,    ai2 = 0,    ai = —6,    a2 = 32. The conic section has equation
x2 + 4y2 -6x + 32y + 48 = 0.
Complete the terms x2 —Qx, 4y2 +32y to squares. The result is
(x - 3)2 + 4(y + 4)2 - 25 = 0,
Definition. The projective plane V2 is the set of all one-dimensional subspaces in R3. The homogeneous coordinates of a point P = (x : y : z) in the projective plane are triples of real numbers given up to a common scalar multiple, at least one of which must be nonzero. A straight line in the projective plane is denned as a set of one-dimensional sub-spaces (i.e. points in V2) which generate a two-dimensional subspace (i.e. a plane) in R3.
For a concrete example, consider two parallel lines in the affine plane R2
Li : y — x — 1 = 0,    L2 : y — 2 + 1 = 0.
If the points of lines L1 and L2 are finite points in projective space V2, then their homogeneous coordinates (x : y : z) satisfy equations
Li : y — x — 2 = 0,    L2 : y — x + z = 0.
the intersection L1 n L2 is the point (—1 : 1 : 0) G V2 in this context. It is the point at infinity corresponding to the common direction vector of the lines.
4.3.2. Affine coordinates in the projective plane. If we be-
j.,<,, gin with the projective plane and if we want to see the affine plane as its "finite" part, then instead of the plane z = lwe may take another plane a in R3 which does not pass through the origin 0 G R3. Then the finite points are those one-dimensional subspaces which have a nonempty intersection with the plane a.
Consider the two parallel lines from the previous paragraph. Put y = 1 to obtain
L'1:l-x-z = 0,   L'2:\-x + z = Q
The "infinite" points of the former affine plane are given by 2 = 0. The lines L[ and L'2 intersect at the point (1,1, 0). This corresponds to the geometric concept that two parallel lines Li, L2 in the affine plane meet at infinity, at the point
(1:1: 0).
4.3.3. Projective spaces and transformations. In a natural way one can generalize the procedure in the affine plane for each finite dimension.
By choosing an arbitrary affine hyperplane An in the vector space Rn+1 which does not pass through origin, we may identify the points P G An with one-dimensional subspaces generated by these points. The remaining one-dimensional subspaces determine a hyperplane parallel to An ■ They are called infinite points in the projective extension Tn of the affine plane An.
The set of infinite points in Vn is always a projective space of dimension one less. An affine straight line has only one infinite point in its projective extension (both ends of the line "intersect" at infinity and thus the projective line looks like a circle). The projective plane has a projective line of
261
CHAPTER 4. ANALYTIC GEOMETRY
or rather
(£Z3£+ÖL±£_1=0.
The conic section is an ellipse with centre at [3, —4]. □
4.C.9. Other characteristics and concepts of conic sections. The axis of a conic section is a line of reflection symmetry for the conic section. From the canonical form of a conic section in polar basis (4.2.4) it can be shown that an ellipse and a hyperbola both have two axes (x = 0 and y = 0). A parabola has one axis (x = 0). The intersection of a conic section and its axis is called a conic section vertex.
The numbers a, b from the canonical form of a conic section (which express the distance between vertices and the origin) are called the length of semi-axes. In the case of an ellipse and hyperbola, the axes intersect at the origin. This is a point of central symmetry for the conic section, called the centre of the conic section.
For practical problems involving conic sections, it is often easiest to describe them in parametric form. Often, this avoids contending with messy square roots.
Every point P on the parabola y2 = Aax, a > 0, can be described by P = (x,y) = (at2, 2at), for real t. The standard parametric form for the parabola is the pair of equations
x = at2   y = 2at,
(Note that the roles of x and y are interchanged, so that the axis of symmetry is the line y = 0.) The tangent line at at2, 2at) has slope \ and equation t(y — 2at) = (x — at2). The point F = (a, 0) on the axis is called the focus of the parabola, and the line x = — a is called the directrix. Each point on the parabola is equidistant from the focus and the directrix. This property can be used to define a parabola.
Every point P on the ellipse ^ + = 1 can be described by P = (x,y) = (a cos 8, b sin 8,) where 0 < b < a. The standard parametric form for the ellipse is the pair of equations
x = a cos 8,   y = b sin 8.
The tangent line at P has slope — bac°*l consequently has equation (a cos 8)(y — bsin 8) = —bcos 8)(x — a cos 8). The positive number e, defined by b2 = a2(I — e2) is called the eccentricity of the ellipse. If e = 0, the ellipse becomes a circle or radius a = b. Otherwise 0 < e < 1. The two points Fi = (ae, 0) and F2 = (—ae, 0) are the foci of the ellipse, and the lines x = ±a/e are the directrices.
infinite points, the three-dimensional projective space has a projective plane of infinite points etc.
More generally, we can define the projectivization of a vector space. For an arbitrary vector space V of dimension n + 1, we define
V{V) = {P <zV; PC V,dimV = l}.
By choosing a basis u in V we obtain homogeneous coordinates on V(V). For a P G V(V) we use the nonzero vector u e V and the coordinates of this vector in a basis u. The points of the projective space V(V) are called geometric points. Their generators in V are called arithmetic representatives.
In the chosen projective coordinates, we fix one of them to be one. Thus we exclude all points of the projective extension which have this coordinate equal to zero. We have an embedding of n-dimensional affine space An C V(V). This is precisely the construction used in the example on the projective plane.
4.3.4. Perspective projection. The advantages of projective jjgj^ geometry show up well in the case of perspective projection R3 —> R2. Imagine that an observer sitting in the origin observes "one half of the world", that is, the points (X, Y,Z) e R3 with Z > 0. The observer sees the image "projected" on the screen given by plane Z = / > 0.
Thus a point (X, Y, Z) in the "real world" projects to a point (x, y) on the screen as follows:
z'
v ■
This is a nonlinear formula. The accuracy of calculations are problematic when Z is small.
By extending this transformation to a map V3 —> V2, we have (X : Y : Z : W) (x : y : z) = (-fX : —fY : Z). That is, a map described by a simple linear formula
(X\
Y
Z \W)
This simple expression defines the perspective projection for finite points in R3 c V3 which we substitute as points with W = 1. In this way we eliminate problems with points whose image runs to infinity. Indeed, if the 2-coordinate of a real point is close to zero, then the value of the third homogeneous coordinate of the image is close to zero, i.e. it corresponds to a point close to infinity.
4.3.5
1
fh
Affine and projective transformations. Each in-jective linear map ip : V\ —> V2 between vector spaces maps one-dimensional subspaces to one-dimensional subspaces. Therefore, we have a map on projectivizations T : T{V\) —> V(V2). Such maps are called projective maps. In the literature, the notion collineation is used if this map is invertible.
262
CHAPTER 4. ANALYTIC GEOMETRY
Every point P on the hyperbola — fs- = 1, 0 < a, 0 < b, can be described by P = (x, y) = (a cosh 9, b sinh 9). The standard parametric form for the hyperbola is the pair of equations
x = a cosh 9,    y = b sinh 9.
The tangent line at P has slope ^sinhg w^ consequently has equation (a cosh 9) (y — b sinh 9) = bcos\i6)(x — a cosh 9). The positive number e, defined by b2 = a2(e2 — 1) is called the eccentricity of the hyperbola. Necessarily, e > 1. The two points Fi = (ae, 0) and F2 = (—ae, 0) are the foci of the ellipse, and the lines x = ±a/e are the directrices. A hyperbola has two asymptotes. In standard form, the equations are y = ±(b/a)x.
4.C.10. Existence of foci. For an ellipse with lengths of semi-axes a > b, show that the sum of the distances from any point on the ellipse to the two foci is constant, namely 2a.
Solution. If P = (a cos 9, b sin 9) and Fi = (ae, 0), then
|PFi|2 = (acos6»-ae)2 + &2sin26»
= a2 cos2 9 - 2a2ecos9 + a2e2 + a2(l - e2)sin2 9 = a2 [-2e cos 9 + e2 - e2(1 - cos2 6»)] = a2(l - ecos6»)2.
So    |PFi|       =       a(l   -   ecos6»). Similarly \PF2\ =a(l +e cos 0).   Hence |Pri| + |PF2|  = 2a. □
Solution. (Alternative). Consider the points X = [x,y], which satisfy the property \F1X\ + \F2X\ = 2a. Coordinate-wise, this implies the equation
\J (x + ae)2 + y2 + \J (x — ae)2 + y2 = 2a Rewrite this as
\J (x + ae)2 + y2 = 2a — \J (x — ae)2 + y2 Square, simplify and square again to get
(l-eV + y2 = a2(l-e2). Substitute b2 = a2(l — e2 and divide by b2 to obtain
— + yl = -\
a2 b2
which is the ellipse in standard form. □ Remark.
Otherwise put, the projective map is a map between projective spaces such that in each system of homogeneous coordinates on the domain and image, it is given by the multiplication by a matrix. More generally if the auxiliary linear map is not injective, then we need to define the projective map only outside of its kernel, that is, on points whose homogeneous coordinates do not map to zero.
Since injective maps V —> V of a vector space to itself are invertible, all projective maps of projective space Vn to itself are invertible. They are also called regular collineations or projective transformations. In homogeneous coordinates, they correspond to invertible matrices of dimension n + 1. Two such matrices define the same projective transformation if and only if one is a (nonzero) multiple of the other.
If we choose the first coordinate as the one whose vanishing defines infinite points, then the transformations preserving infinite points are given by matrices whose first row vanishes up to its first element. If we wish to switch to affine coordinates of finite points, (i.e the first coordinate is fixed at one), the first element in the first row also equals one. Hence the matrices of collineations preserving finite points of the affine space have the form:
/I     0 0\
bi    an    '' ' ain
where b = (b1,..., bn)T e K™ and A = (a{j) is an invertible matrix of dimension n. The action of such a matrix on the vector (1, x1,..., xn) is exactly a general affine transformation, where b is the translation and A is its linear part. Thus the affine maps are exactly those collineations which preserve the hyperplane of points at infinity.
4.3.6. Determining collineations. In order to define an aSine maP> it is necessary and sufficient to de-S$.      fiX, fine an image of the affine frame. In the above 'S/T'j^:     description of affine transformations as special _j£s£,—    cases of projective maps, this corresponds to a suitable choice of an image of a suitable arithmetic basis of the vector space V.
In general it is not true that the image of an arithmetic basis of V determines the collineation uniquely. The basic problem is illustrated by a simple example of affine plane.
Choose four points A, B, C, D in the plane such that no three of them lie on a line. Then choose their images in the collineation as follows:
Choose arbitrarily their four images A', B ,C, D' with the same property, and choose their homogeneous coordinates u, v, w, z, u', v', w', z' v R3. The vectors z and z' can be written as linear combinations
z = c\u + c2v + C3W,    z' = dxu' + c'2v' + c'3w',
where all six coefficients must be nonzero, otherwise there exist three points not in general position.
263
CHAPTER 4. ANALYTIC GEOMETRY
	_______—— Vertices
	1
	
	
foci	-— a&s.
Petiex
Similarly, the hyperbola foci are the points Fi,F2, which satisfy II-F2-X] — l-Fi-Xj I = 2a for an arbitrary X on the hyperbola. You can check this in same the way as above for the ellipse, with Fi = [ae, 0], F2 = [—ae, 0], ae = Va2 + &2.
Parabola focus If the parabola has equation a;2 = 2py, the focus is the point F with coordinates F = [0, §]. It is characterized by the fact that the distance between this point and an arbitrary X on parabola is equal to the distance between X and line y = — §.
4.C.11.   Find the foci of the ellipse a;2 + 2y2 = 2.
Solution. From the equation that semi-axes lengths are a = \/2 and b = 1. Compute (see 4.C.10): ae = \Ja2 — b2 = 1 The foci coordinates are at [—1,0] and
[1,0]. □
4.C.12. Prove that the product of the distances between the foci of an ellipse and any tangent line is constant. Find the value of the constant.
Solution. Every point T on the ellipse has coordinates T = (x,y) where x = a cos 8, y = bsin8 for some 6. The tangent line to the ellipse at T has equation
£>cos(
y — b sin ŕ
- (a; — a cos t
a (sin 0)(y — bsin 6) = — b(cos ff)(x — a cos 6).
This meets the x — axis at the point (a/cos9,0). The distance from the focus Fi to the tangent line is D\
= [a/cos6 — ae] sirup where tamp = ±bac°*90. Eliminate
Now choose new arithmetic representatives u = cpa, v = c2v and w = c3w of points A, B and C respectively. Similarly u' = cpu', v' = c2v' and w' = c3w' for points A, B and C". This choice defines an unique linear map p which maps successively
p(u) = v!,   p(v') = -y',   p(w) = w'.
But then,
p(z) = + W + w) = v! + v' + w' = z',
and so the constructed collineation maps the points which we have chosen in advance. The linear map p is given uniquely by the construction, thus the collineation is given uniquely by this choice.
The argument holds also in the case when some of the chosen points are infinite (i.e. one or two). The same phenomenon can be explained even more easily by the regular collineation of a projective line. These are denned by pair-wise different images of three pairwise different points.
The procedure works in an arbitrary dimension n. Then we say that n + 2 points are in general position if no n + 1 of them lie in the same hyperplane. We also call these points linearly independent, forming a geometric basis of projective space.
Theorem. A regular collineation on n-dimensional projective space is uniquely determined by linearly independent images ofn + 2 linearly independent points.
Proof. The proof is exactly the same as in dimension two. We recommend writing it in detail as an exercise. □
4.3.7. Cross-ratio. Recall that affine maps preserve ratios of lengths of line segments on each line. Technically, we denned this ratio as for three points A, B and C ^ B, C = rA + sB as A = (C;A,B) = -^. For central projection the ratios are not preserved. Moreover, even the relative position of points on a line is not necessarily preserved. On the contrary we may determine uniquely a projective transformation by choosing arbitrarily images of three pairwise different points on a projective line. One can show relatively easily that the ratio of such ratios for two distinct points C is preserved:
Consider four distinct points A,B, C, D in projective space with arithmetic coordinates x, y, w, z respectively which lie on a projective line. Since these four vectors lie in the subspace generated by (x,y), we may write w and z as linear combinations
w = t\x + siy,    z = t2x + s2y.
Define the cross-ratio of four points (A, B, C, D) as
si t2
P = -,--•
h s2
The definition is valid, since although the vectors x and y are determined up to a scalar multiple, these multiples cancel out in the definition.
264
CHAPTER 4. ANALYTIC GEOMETRY
<p to get
D\ = a2(l - ecos6»)2[
= (1 - ecos6»)2[
= (1 - ecos6»)2[
a2 sin2 9 + b2 cos2 (
_hl_l
sin2 6»+ (1 -e2)cos26»J
b2
= bz(\ - ecos6»)[
1 — e2 cos2 ( 1
1 + ecos#)
Since D2 is the same as D1 with e replaced by —e, it follows Xh&xD1D2 = b2. □
Solution. (Alternative). Consider the polar basis.The ellipse matrix has diagonal shape diag(Jj, p-, — 1) and the tangent equation at X=[xo,yo] is ^frx + fj-y = 1. The distance between Fi,F2 = [Tae, 0] and this line equals
Its product is
XQ _I_ Vo_
a4 + b4
1 _ P2£o. 1      e „4
a4 "t" b4
2 2
If we substitute a2e2 = a2 —b2 and |§- = 1— ^g- (the point X is lying on the ellipse), we find that the previous term equals b2. □
4.C.13. Projective approach to conic section. Projective space gives an ability to approach the conic section from a new perspective (compare with 4.3.11). We can understand conic sections in 82 denned by the quadratic form
f(x, y) = aux2 + 2a12xy + a22y2 + 2a13x + 2a23y + a33
as a set of points in projective plane V2 with homogenous coordinates (x : y : z), which are the zero points of the homogenous quadratic form
f(x,y,z) = a11x2+2a12xy+a22y2+2a13xz+2a23yz+a33z2
Or rather f(v) = vTAv, where v is a column vector with coordinates (x, y, z) and matrix A is symmetric matrix (a^). By theorem 4.2.6, there exists a basis in which this quadratic form has one of the following equations
f(x, y, z) = x2 + y2 + z2,    f(x, y, z) = x2 + y2 - z2.
In the former case there is only one solution of / (x, y, z) = 0 and therefore the original form does not represent a real conic section. The second quadratic form represents a cone in R3. We obtain the corresponding conic section by moving back to inhomogeneous coordinates. That means intersecting the
Similarly, each projective transformation preserves cross-ratios. Indeed, if the transformation is given in arithmetic coordinates by a matrix A, we have images A-w = t\A-x + t2A ■ y, and similarly for Az. Therefore the four images have the same cross-ratio.
We discuss the characterization of projective transformations. These are exactly those maps which preserve cross-ratios. But this is not a very practical characterization, since it contains implicitly the claim that these maps map projective lines to projective lines.
One can prove a much stronger statement. A map of arbitrarily small open area in affine space R™ (e.g. a ball without boundary) into the same affine space which maps lines to lines is actually a restriction of a uniquely determined projective transformation of the projective extension VM.n+1 of the former affine space R™. Thus these transformations also preserve cross-ratios.
Duality. The projective hyperplanes in n-dimensional projective space V(V) are defined as the projectivizations of n-dimensional vector subspaces in the vector space V. Hence in homogeneous coordinates, they are denned as kernels of linear forms a eV* which in turn are determined up to a scalar multiple.
Thus in a chosen arithmetic basis, a projective hyper-plane is given by a row vector a = (q0, ...,q„). But the forms a are given uniquely up to a scalar multiple. Therefore, each hyperplane in V is identified with exactly one geometric point in the projectivization of the dual space V(V*). We call such a space the dual projective space, and we talk about a duality between points and hyperplanes.
Of forms, the linear map denning a given collineation acts by the multiplication of row vectors from the right by the same matrix
= (Qo, •
i—y q ■ A.
The matrix of the dual map is AT. But the dual map maps forms in the opposite direction, from the "target space" to the "initial one". Therefore the inverse map for the collineation of /is required in order to study the effect of regular collineations on points and their dual hyperplanes. The inverse is given by the matrix A-1. Hence the matrix for the action of the corresponding collineation on forms is (AT)~1. Since the inverse matrix equals the algebraically adjoint matrix Allg, up to the multiplication by the inverse of determinant, ( see equation (1) on page 94,) we can work directly with the projective transformation of the space V(V*) given by the matrix (^4*lg)T (or without transposing if we multiply row vectors from the right).
The projective point X belongs to the hyperplane a if the arithmetic coordinates satisfy a ■ x = 0. It still holds after acting with an arbitrary collineation, since
(ft ■ A'1) ■ (A ■ x) = ft ■ x = 0.
265
CHAPTER 4. ANALYTIC GEOMETRY
cone with the plane which has the equation z = 1 in the original basis. Immediately we obtain the conic section classification from 4.29., which corresponds to the intersecting cone in R3 with different planes. Non-degenerate sections are depicted. Degenerate sections are those which pass through the vertex of the cone.
We define the following useful terms for a conic section in projective plane:
Points P, Q£ V2 corresponding to one-dimensional sub-spaces (p), (q) (generated by vectors p, q £ R3) are called polar conjugated with respect to conic section /, if F(p, q) = 0, or rather pTAq = 0.
Point P= (p) is called singular point of conic section /, when it is polar conjugated with respect to / with all points of the plane, so F(p, x) = 0 Va; £ V2. In other words, Ap = 0. Hence the matrix A of the conic section does not have maximal rank and therefore defines a degenerate conic section. Non-degenerate conic sections do not contain singular points.
The set of all points X= (x) are called polar conjugated with P = (p) polar of the point P with respect to the conic section /. It is therefore the set of points for which F(p, x) = pTAx = 0. Because the polar is given by a linear combination of coordinates, it is always (in the non-singular case) a line. The following explains the geometric interpretation of polar.
4.C.14. Polar characterization. Consider a non-degenerate conic section /. The polar of a point P £ f with respect to / is the tangent to / with the touch point P. The polar of the point P <£ / is the line defined by the touch points of the tangents to / passing through P.
Solution. First consider P£ /. Suppose that the polar of P, defined by F(p, x) = 0, intersects / in Q= (q) =£P. Then F(p, q) = 0 and /(g) = F(q, q) = 0. For an arbitrary point X = (x) lying on P and Q, x = ap + j3q for some a, j3 £ R.
4.3.9. Fixed points, centers and axes. Consider a regular collineation / given in an arithmetic basis of projective space V(V) by a matrix A.
By the fixed point of the collineation /, we mean a point A which is mapped to itself. That A. By the fixed hyperplane of collineation / is meant a hyperplane a which is mapped to itself. That is, /(a) C a.
Hence the arithmetic representatives of fixed points are exactly the eigenvectors of the matrix A.
In the geometry of the plane, we meet many types of collineations: reflection through a point, reflection across a line, translation, homothety etc. Perhaps we remember also some types of projections, e.g. the projection of a plane in R3 to another from a center S £ R3.
Note also that there appear fixed lines next to fixed points in all cases of such affine maps. For example, the reflection through a point preserves also all lines passing through this point. In the case of a translation the infinite points behave similarly.
Now we discuss this phenomenon in an arbitrary dimension. First, we define a classical notion related to the incidence of points and hyperplanes.
A set of hyperplanes passing through a point A £ V(V) is a set of all hyperplanes which contain the point A. For each point A the corresponding set of hyperplanes itself is a hyperplane in the dual space V(V*). It is given by one homogeneous linear equation in arithmetic coordinates.
For a collineation / : V(V) -> V(V), a point S £ V(V), is called the center of collineation /, if all hyperplanes in the set determined by S are fixed hyperplanes. A hyperplane a is called the axis of collineation f if all its points are fixed points.
It follows that the axis of a collineation is the center of the dual collineation, while the set of hyperplanes defining the center of collineation is the axis of the dual collineation.
Since the matrices of a collineation on the former and the dual space differ only by the transposition, their eigenvalues coincide (the eigenvectors are column vectors, respectively row vectors corresponding to the same eigenvalues). For example in the projective plane (and for the same reason in each real projective space of even dimension) each collineation has at least one fixed point, since the characteristic polynomials of corresponding linear maps are of odd degree. Hence they have at least one real root.
Instead of discussing a general theory, we illustrate its usefulness in several results for projective planes. .
Proposition. A projective transformation other than the identity has either exactly one center and exactly one axis, or it has neither a center nor an axis.
Proof. Consider a collineation / on VM3 and assume that it has two distinct centers A and B. Denote by £ the line given by these two centers, and choose a point X in the projective plane outside of I. If p and q are the lines passing through pairs of points (A, X) respectively (B, X), then
266
CHAPTER 4. ANALYTIC GEOMETRY
Because of the bilinearity and symmetry of F,
f(x) = F(x,x) = a2F(p,p)+2a(3F(p, q)+(32F(q,q) = 0.
So every point X of the line lies on the conic section /. However, when the conic section contains a line, it has to be degenerate, which is a contradiction.
The claim for P £ f follows from the corollary of the symmetry of the bilinear form F. When the Q lies on the polar of P, then P lies on the polar of Q.
□
Using polar conjugates we can find the axes and the centre of the conic sections without using the Lagrange algorithm.
Consider the conic section matrix as a block matrix
A =
A a
where A = (a^) for i,j = 1,2, a is vector (013,023) and a = 033. This means that the conic section is denned by the equation
uTAu + 2aTu + a = 0 for a vector u= (x,y). Now we show that
4.C.15. The axes of a conic section are the polars of the points at infinity determined by the eigenvectors of the matrix A.
Solution. Because of the symmetry of A in the basis of its eigenvectors, it has a diagonal shape D = , where
A, [i e R and this basis is orthogonal. Denote by U the matrix changing basis to a basis of eigenvectors (columns), then the conic section matrix is
'UT 0\ (A a\ (U 0 , 0    1 J loT   a   U 1
D UTa aTU a
f(p) = p and /(g) = q. In particular, X is fixed. But then all points of the plane outside of I are fixed. Hence each line different from I has all points out of I fixed and thus also its intersection with I is fixed. It follows that / is the identity mapping. So it is proved that every projective transformation other than the identity has at most one center. The same argument for the dual projective plane proves that there is at most one axis.
If / has a center A, then all lines passing through A are fixed. They correspond therefore to a two-dimensional sub-space of a row eigenvectors of the matrix corresponding to the transformation /. Therefore, there exists a two-dimensional subspace of column eigenvectors for the same eigenvalue. This represents exactly the line of fixed points, hence it represents the axis. The same consideration in the reversed order proves the opposite statement - if a projective transformation of plane has an axis, then it has also a center. □
For practical problems it is useful to work with complex projective extensions also in the case of a real plane. Then the geometric behaviour can be easily read off the potential existence of real or imaginary centers and axes. picture missing!
4.3.10. Pappus Theorem. The following result known as Pappus theorem is a classic result of projective geometry.
Proposition. Let two triples of distinct consecutive collinear points {A, B, C} and {A1, B, C"} lie on two lines that meet at the point T, which is closest to A and A', respectively. Define points Q, R and S as
Q = [AB]r\[BA% R = [AC]C\[CA'], S = [BC]n[CBf].
Then {Q, R, S} are also collinear.
Proof. Without loss of generality, consider the plane, passing through {T, A, B, C, A', B, C } as a 2-dimensional plane in V2 defined by z = 1 in the homogeneous coordinates
(x :y: z).
The points {T, A, B, C, A', B, C"} may be considered as objects in V2, representing lines through the origin in R3 with directional vectors {t,a,b,c,a',b',c'}, respectively. These can be chosen up to a real non-zero factor. The condition {z = 1} uniquely identifies those points in R3 regardless of the choice of {t, a, b, c, a', b', c'}. Since {T, A, B, C} are collinear points, (they lie in the same 2-dimensional linear subspace of R3), we may assume that this plane is generated by t and a. Choose
b = t + a,    c = Xt + a,
and analogously, for {T, A', B, C"}
b' = t + a',    c' = X't + a'
for some real constants A and A'. It is only necessary to show that the vectors q, r, s, representing Q, R, S in V2 generate a 2-dimensional subspace in R3. Since
(t + a) + a' = a + (t + a'),
267
CHAPTER 4. ANALYTIC GEOMETRY
So in this basis there is the canonical form defined by vector UTa (up to a translation). Specifically, denote the eigenvectors by v\, Vfj,, and then
w     aT«A,2     ,     aTv^2     (aTvx)2 (aTv,)2 \(x + —A)2 + fl(y+-it)2 = y--+ ±- d.
A jjb A jjb
This means that the eigenvectors are the direction vectors of the conic section axes (main directions). The axes equa-
T T
tions in this basis are x = — and y = — . The axes coordinates u\ and uM in the standard basis satisfy v^u\ =
T T
— s-^x and        = —       because v\[\u\ + a) = 0 and v^lfiu^+a) = 0. These equations are equivalent to the equations vJ(Au\ + a) = 0 and v^Au^ + a) = 0 which are the polar equations of the points defined by the vectors vx&v^. □
4.C.16. Remark. A corollary of the previous claim is that the centre of the conic section is polar conjugated with all points at infinity. The coordinates of the centre s then satisfy the equation As + a = 0.
If det(A) ^ 0, then the equation As + a = 0 for centre coordinates has exactly one solution if S = det(vl) ^ 0, and no solutions if S = 0. That means that, regarding non-degenerate conic sections, the ellipse and the hyperbola have exactly one centre. The parabola has no centre, (its centre is point at infinity).
4.C.17. Prove that the angle between the tangent to the parabola (with arbitrary touch point) and the parabola axis is the same as the angle between the tangent and the line connecting the focus and the point of tangency
Solution. The polar (i.e. tangent) of a point X=[x0,y0] to a parabola defined by the canonical equation in the polar basis is a line satisfying
A	0	°\ i	
0	0	-p)	
V>	-p	o I 1	
The cosine of the angle between the tangent and the axis of the parabola (x = 0) is given by the dot product of the corresponding unit direction vectors. The unit direction vector of the tangent is
\/p2+xl 1
(p, x0) and therefore
(p,xo).(0,l)
x0
Vp2 + xo   '        '        VP2 + xo Now we show that this is the same as the cosine of the angle between the tangent and line connecting the focus F=[0, §],
q = t + a + a' represents Q. Since
AA'i + A'a + Aa' = \{\'t + a') + A'a = \'(\t + a) + Aa',
r = AA'i + A'a + Aa' represents R. Finally,
s = g — r = t + a + a' — XX't — A'a — Aa' = (l-A')(i + a) + (l-A)(A'i + a') = (l-A)(i + a') + (l-A')(Ai + a)
represents the point S. Thus, the points {Q, R, S} lie in the 2-dimensional subspace generated by vectors q and r. Since Q, R, S also belong to the plane {z = 1}, these points are collinear. □
4.3.11. Projective classification of quadrics. To end this section, we return to conies and quadrics. A quadric Q in n-dimensional affine space R™ is defined by a general quadratic equation (1), see page 253. By viewing the affine space R™ as affine coordinates in projective space VM.n+1 we may wish to describe the set Q by homogeneous coordinates in projective space. The formula in these coordinates should contain only the terms of second order since only a homogeneous formula is independent of the choice of the multiple of homogeneous coordinates (xo, xi,..., xn) of a point. Hence we search for a homogeneous formula whose restriction to affine coordinates, (that is, substitution xq = 1), gives the original formula (1).
But this is especially easy. Simply add enough x0 to all terms - nothing to the quadratic terms, one to the linear terms and x2, to the constant term in the original affine equation for Q.
We obtain a well defined quadratic form / on the vector space Rn+1 whose zero set defines correctly the projective quadric Q.
The intersection of a "cone" Q C Rn+1 of the zero set of this form with the affine plane x0 = 1 is the original quadric Q whose points are called the proper points of the quadric. The other points Q \ Q in the projective extension are the infinite points.
The classification of real or complex projective quadrics, up to projective transformations, is a problem already considered. It is all about finding the canonical polar basis, see paragraph 4.29. From this classification, given by the signature of the form in the real case and by the rank only in the complex case, we can deduce also the classification of the affine quadrics. We show the essential part of the procedure in the case of conies in the affine and the projective plane.
The projective classification gives the following possibilities, described by homogeneous coordinates (x : y : z) in the projective plane VM3:
• imaginary regular conic given by x2 + y2 + z2 = 0
• real regular conic given by xz + yz — zz
• pair of imaginary lines given by x2 + y2
• pair of real lines given by x2 — y2 = 0
• one double line x2 = 0.
: 0 0
268
CHAPTER 4. ANALYTIC GEOMETRY
and the touch point X. The unit direction vector of the connecting line is
1
\Jxl + (Vo
=(x0,yo - tt)-
V\2 A
For the cosine of the angle,
1 1
\/P2 + ^^ + (yo-|)
2 2
Substitute yo
^ to obtain   ,x° .
2p 7^
This example shows that lightrays striking parallel with axis of parabolic mirror are reflecting to the focus and, vice versa, light rays going through focus reflect in direction parallel with axis of parabola. This is the principle of many devices such as parabolic reflectors. □ Solution. (Alternative) At the point P = (at2,2at) on the parabola, the tangent line has slope (1/t) and the focus is at (a, 0). So the line joining P to the focus F has slope 2°/~a° = (ta2^!) ■ If 0 is the angle between the tangent line and the x — axis, then tan 0 = 1/t, so
2 tan 9            lit It tan2y = -=— =---7t- = -=-
l-tan26»     (1 - 1/t2) t2-l
By subtraction, the angle between the tangent line and the line joining P to the focus is 6.
Note that the tangent line meets the a;-axis at Q where Q = (—at2,0). The result follows from showing that \FP\ = FQ |, and hence the triangle QFP is isosceles. □
You can find many more examples on quadrics on D
We consider this classification as real, that is, the classification of quadratic forms is given not only by its rank but also by its signature. Nevertheless, the points of a quadric are considered also in the complex extension. In this way we should understand the stated names. For example the imaginary conic does not have any real points.
4.3.12. Alfine classification of quadrics. For an affine classification we must restrict the projective transformations to those which preserve the line of infinite points. This can be seen also by the converse procedure — for a fixed projective type of conic Q, that is, its cone Q C R3, we choose different affine planes a c K3 which do not pass through the origin. We observe the changes to the set of points Qfla, which are proper points of Q in affine coordinates, as realized by the plane a.
Hence in the case of a regular conic there is a real cone Q given by the equation z2 = x2 + y2. As planes a we may for instance choose the tangent planes to unite the sphere. If we begin with the plane z = 1, the intersection consists only of finite points forming a unit circle Q. By a gradual change of the slope of a we obtain a more and more stretched ellipse until we get such a slope that a is parallel with one of lines of the cone. At that moment there appears one (double) infinite point of the conic whose finite points still form one connected component, and so we have a parabola. Continuing to change the slope gives rise to two infinite points. The set of finite points is no longer connected, and so we obtain the last regular quadric in the affine classification, a hyperbola.
We can take advice from the introduced method which enables us to continue the classification in higher dimensions. In particular, we notice that the intersection of the conic with the projective line of infinite points is always a quadric in dimension one less. It is either the empty set or a double point or two points as types of quadrics on a projective line. Next we found an affine transformation transforming one of possible realizations of a fixed projective type to another one, only if the corresponding quadrics in the infinite line were projec-tively equivalent. In this way, it is possible to continue the classification of quadrics in dimension three and above.
269
CHAPTER 4. ANALYTIC GEOMETRY
o
D. Further exercise on this chapter
4.D.I. Find a parametric equation for the intersection of the following planes in R3:
a : 2x + 3y - z + 1 = 0   a   p : x - 2y + 5 = 0.
4.D.2. Find a common perpendicular for the skew lines
p: [1,1,1] +i(2,l,0),   q: [2, 2, 0] + t(1,1,1).
o
4.ZJ.3. Jarda is standing in [—1,1, 0] and has a stick of length 4. Can he simultaneously touch the lines p and q, where
p   :   [0,-1,0] +t(l,2,l), q   :   [3,4,8]+s(2,1,3)?
(The stick must pass through [—1,1,0].) O
4.D.4. A cube ABCDEFGH is given. The point T lies on the edge BF, with \BT\ = j\BF\. Compute the cosine of the angle between ATC and BDE. O
4.D.5. A cube ABCDEFGH is given. The point T lies on the edge AE, with |AT\ = \\AE\. S is the midpoint of AD. Compute the cosine of the angle between BDT and SCH. O
4.D.6. A cube ABCDEFGH is given. The point T lies on the edge BF, \BT\ = ^\BF\. Compute the cosine of the angle between ATC and BDE. O
4.D.7.   What are the lengths of semi-axes, when the sum of their lengths equals the distance between foci both equal 1.
Solution. It is given that a + b = 1 and 2ae = 1. Also b2 = a2(I — e2). Eliminating e gives b2 = a2 — (1/4). So 1/4 = a2 - b2 = (a-b)(a + b) = a - b. So a = 5/8 and b = 3/8. □
Solution. (Alternative.) Solve the system
a + b   = 1 2e=   2Va2 - b2   = 1
and find solution a = |, b = |. □
4.D.8.   For what slopes k are the lines passing through [—4,2] secant and tangent lines of the ellipse defined by
— + -t = i
9 4
Solution. The direction vector of the line is (1, k) and its parametric equations then are x = — 4 + t, y = 2 + kt. The intersection with the ellipse satisfies
M + *)2     (2 + kt)2 9 4 This quadratic equation has discriminant equal to
D = -^(7k + 16). 9
This implies that for k e (—^ > 0) there are two solutions, and the line is a secant. For k = — ±2 and fc = 0 there is only one solution and the line is a tangent to the ellipse. □
270
CHAPTER 4. ANALYTIC GEOMETRY
4.D.9. Find all lines tangent to the ellipse 3x2 + 7y2 = 30, so that the distance from the centre of the ellipse to the tangent is 3.
Solution. All lines at distance 3 from the origin are tangents to the circle centre at [0,0] and radius 3. They all have an equation x cos 9 + y sin 9 = 3 for some 8.Tins line meets the standard ellipse x2/a2 + y2/b2 = 1 where
x2     (3 -xcosff)2 _ 'a2 +     b2 sin2 9
or
a;2(a2cos26»+č>2sin2 6)-6a2x cos 6-a2(b2 sin2 0-9) = 0
It is a tangent line if the above equation has a double root for x. Thus it is required that
36a4 cos2 9 = 4(a2 cos2 9 + b2 sin2 6») (9 - b2 sin2 9). This simplifies to requiring that
a2 cos2 9 + b2 sin2 9 = 9.
This implies
2 a      (9 - &2) ■2a (a2-9)
For the given problem a2 = 10 and b2 = 30/7. The solution is iVŠŠ + yVŤ = SVAČ). □ Solution. (Alternative.) The tangent line is (y - b sin 9)= - ^g-§ (x - a cos 6») with a2 = 10 and b2 = 30/7. The distance to the origin, 3, implies 3 = (a/ cos 9) sin p where
tan p = ^Hj-f 3 cos 9 = a sin p where a sin 9 tan p = b cos 9 3 sin 9 = b cos p 9/a2cos29 + 9/b2sin29 = 1. 9 cos2 9 + 21 sin2 9 = 10. 12 sin2 9 = 1
(y- vWŠ) = -(z- V^/v^IvW?] □ Solution. (Alternative.) The centre of the ellipse is at the origin. The distance d between the line ax + by + c = 0 and the origin is d = ^^+b2 ■ The tangent then satisfies a2 + b2 = ^-. The equation of the tangent passing through the point [xt , yr] is 3xxt + 7yyr — 30 = 0. For coordinates of the point of tangency,
(3xT)2 + (7yT)2   = 100 4 + 7yT
3x2T + 7y2T   = 30
Its solution is xt = ±^J^,yT = ±y jg. Considering the symmetry of ellipse, there are four solutions
±3^ff x ± 7^y - 30 = 0. □
4.D.10.   A hyperbola x2 — y2 = 2 is given. Find an equation of a hyperbola having the same foci and passing through point
[-2,3].
Solution. The given hyperbola has a2 = b2 = 2, so a2e2 = a2 + b2 = 4, and the foci are at (±ae, 0) = (±2,0). So the desired hyperbola has equation
V(x-2)2+y2-V(x + 2)2 + y2 = k,
for some constant k. Since the hyperbola passes through [—2,3], k = 2. Squaring gives
y/(x - 2)2 + y2 = [y/(x + 2)2 + y2 + 2],
(x - 2)2 + y2 = (x + 2)2 + y2 + AyJ{x + 2)2 + y2 + 4 (—2x — l)2 = (x + 2)2 + y2 or 3x2 = y2 + 3 which is the required hyperbola. □
271
CHAPTER 4. ANALYTIC GEOMETRY
Solution. (Alternative.) The equation of the desired hyperbola is    —    = 1, with its eccentricity e satisfying
a2e2 = a2 + b2 = 4, since the foci are at [±ae, 0] = [±2,0]. The point [—2,3] lies on the hyperbola, so     —     = 1. It
follows that a2 = \,b2 = 3. The desired hyperbola is x2 — ^- = 1. □
4.D.11.   Determine the equations of the tangent lines to the hyperbola Ax2 — 9y2 = 1, which are perpendicular to line
x - 2y + 7 = 0.
Solution. All lines perpendicular to the given line have an equation 2x + y + c = 0 for some c. So the line has an intersection with a double root with the given hyperbola. So the equation Ax2 — 9(—2a — c)2 = 1 has a double root. Hence (36c)2 —
4.32.(9c2 + l) = 0,andc = ±2^2. □
4.D.12. Determine the tangent to the ellipse fg + ^ = 1 which is parallel with line x + y — 7 = 0. Solution. The lines parallel with the given line intersect this line in a point at infinity (1 : —1 : 0). Construct tangents to given ellipse passing through this point. The point of tangency T= (ti : t2 : £3) lies on its polar and therefore satisfies — ^ = 0, so t2 = jgh- Substituting into the ellipse equation, we get t1 = The touching points of the desired tangents are |] and [—— |]. The tangents are polars of those points. They have equations x + y = 5 and x + y = —5. □ Solution. (Alternative). The given line has slope — 1. The tangent line at (4 cos 6,3 sin 6) has slope — ff^f, so it is required thattan# = |. The tangent line has equation (y — 3 sin 6) = (—l)(x — 4cos 6) where either sm6 = 3/5 and cos# = 4/5 or sin 6 = —3/5 and cos 6 = —4/5. The two solutions are x + y = ±5. □
4.D.13.   Determine the points at infinity and the asymptotes of the conic section
2x2 + Axy + 2y2 - y + 1 = 0
Solution. The equation for the points at infinity of 2x2 + Axy + 2y2 = 0 or rather 2(x + y)2 = 0 has a solution x = —y. The only point at infinity therefore is (1 : —1 : 0), so the conic section is a parabola. The asymptote is a polar of this point, specifically the line at infinity z = Q. □
4.D.14. Prove that the product of the distances between an arbitrary point on a hyperbola and both of its asymptotes is constant. Find its value.
Solution. Denote the point lying on the hyperbola by P. The asymptote equation of the hyperbola in canonical form is bx±ay = 0. Their normals are (b,±a) and from here we determine the projections Pi, P2 of point P to asymptotes. For the distance between point P and asymptotes we get \PPi,2\ = ^=f=||. The product is therefore equal to a „2+^ = Jqrp-> because P lies on hyperbola. □
4.D.15.   Compute the angle between the asymptotes of the hyperbola 3x2 — y2 = 3.
12_ 2
Solution. For the cosine of the angle between the asymptotes of the hyperbola in canonical form, cos a = b2+^ ■ In this case the angle is 60 degrees. □
4.D.16.   Locate the centers of the conic sections
(a) 9a;2 + 6ay - 2y - 2 = 0
(b) a2 + 2ay + y2 + 2a + y + 2 = 0
(c) a2 - 4ay + 4y2 + 2a - Ay - 3 = 0
(d) = l
Solution, (a) The system As + a = 0 for computing centers is
9Sl + 3s2   = 0 3si - 2   =   0 '
Solve it to obtain the center at [|, —2].
272
CHAPTER 4. ANALYTIC GEOMETRY
(b) In this case,
S1 + S2 + I = 0 si + s2 + \   = 0.
Therefore there is no proper center (the conic section is a parabola). Moving to homogeneous coordinates we can obtain the center at infinity (1 : —1 : 0).
(c) The coordinates of the center in this case satisfy
Sl-2s2 + l   = 0 -2si+4s2-2   = 0.
The solution is the line of centers. This is so because the conic section is degenerate: it is a pair of parallel lines.
(d) The center is at (a, 0). The coordinates of the center therefore give the translation of the coordinate system to the frame in which the ellipse has its basic form.
□
4.D.17.   Find the equations of the axes of the conic section Qxy + 8y2 + Ay + 2x — 13 = 0.
Solution. The major and minor axes of the conic section are in the direction of the eigenvectors of matrix ^jj   ^ . The
characteristic equation has the form A2 — 8A — 9 = 0. The eigenvalues are therefore Ai = —1, A2 = 9. The corresponding eigenvectors are then (3, —1) and (1, —3). The axes arethepolars of points at infinity defined by those directions. For (3, —1), the axis equation is —3a; + y + 1 = 0. For (1, —3) it is —9a; — 21y — 5 = 0. □
4.D.18.   Determine the equations of the axes of the conic section 4a;2 + 4a;y + y2 + 2x + Qy + 5 = 0.
(A 2\
Solution. The eigenvalues of the matrix I ^ ^ J are Ai = 0, A2 = 5 and the corresponding eigenvectors are (—1,2) and (2,1). There is one axis 2a; + y + 1 = 0, and the conic section is a parabola. □
4.D.19. The equation
a;2 + 3a;y — y2+a; + y + l = 0. defines a conic section. Determine its center, axes, asymptotes and foci.
4.D.20.   Find the equation of the tangent at P=[l, 1] to the conic section
4a;2 + 5y2 - 8a;y + 2y - 3 = 0
Solution. By projecting, this is a conic section defined by the quadratic form (x, y, z)A(x, y, z)T with matrix
Using the previous theorem, the tangent is a polar of P, which has homogenenous coordinates (1 : 1 : 1). It is given by equation (1,1, l)A(x, y, z)T = 0, which in this case gives
2y - 2z = 0
Moving back to inhomogeneous coordinates, the tangent line equation is y = 1.
□
273
CHAPTER 4. ANALYTIC GEOMETRY
4.D.21.   Find the coordinates of the intersection of the y axis and the conic section denned by
5x2 + 2xy + y2 - 8x = 0
Solution. The y axis, is the line x = 0. It is the polar of the point P with homogeneous coordinates (p) = (p± : p2 : p3). That means that the equation x = 0 is equivalent to the polar equation F(p, v) = pTAv = 0, where v = (x, y, z)T. This is satisfied when Ap = (a, 0, 0)T for some aeR, This condition gives the conic section matrix
equation system
5pi + P2 - 4p3   = aj P1+P2   = 0 -4pi   = 0
We can find the coordinates of P by the inverse matrix, p = A-1 (a, 0, 0)T, or solve the system directly by backward substitution. In this case we can easily obtain solution p = (0, 0, — \a). So the y axis touches the conic section at the origin.
□
4.D.22.   Find a touch point of the line x = 2 with the conic section from the previous exercise.
Solution. The line has equation x — 2z = 0 in its projective extension and therefore we get the condition Ap = (a, 0, — 2a) for the touch point P, which gives
5pi + P2 - 4p3   = a P1+P2   = 0 -4pi   = -2a
Its solution is p = (|a, — |a, ^a). These homogeneous coordinates are equivalent to (2, —2,1) and hence the touch point has coordinates [2,—2]. □
4.D.23.   Find equations of the tangents passing through P= [3,4] to the conic denned by
2x2 - Axy + y2 - 2x + 6y - 3 = 0.
Solution. Suppose that the point of tangency T has homogeneous coordinates given by a multiple of the vector t= (ti,t2,t3). The condition that T lies on the conic section is tT At = 0, which gives
2t\ - Ufa + t22- 2*1*3 + 6*2*3 - 3*3 = 0
The condition that P lies on the polar of T is pT At = 0, where p = (3,4,1) are the homogeneous coordinates of point P. In this case, the equation gives
/ 2       2   -1\ /*A (3,4,1)   -2    1     3       *2   =-3*i+*2 +6*3 = 0 V-l    3    -3j \t3J
Now we can substitute t2 = 3*i — 6*3 to the previous quadratic equation. Then
-t\ + 4*1*3 - 3*3 = 0
Because the equation is not satisfied for *3 = 0, we move to inhomogeneous coordinates (|^,    1), for which we get
-(|)2 + 4(|)-3 = 0   a |=3(|)-6,
tj. = 1 a || = —3, nebo |^ = 3 a || = 3. So the touch points have homogeneous coordinates (1 : —3 : 1) and (3:3:1). The tangent equations are the polars of those points 7x — 2y — 13 = 0 and x = —3. □
274
CHAPTER 4. ANALYTIC GEOMETRY
4.D.24.   Find an equation of the tangent passing through the origin to the circle
a2 + y2 - Wx - Ay + 25 = 0
Solution. The touch point (t1 : t2 : t3) satisfies
/ 1     0    -5\ (tl\ (0,0,1)    0     1    -2 \\t2   =-5fi-2i2 + 25 = 0 \-5   -2   25/ \t3)
From here we eliminate t2 and substitute into circle equation, which (t1 : t2 : t3) has to be satisfied as well. We obtain the quadratic equation 29/2 — 250/1 + 525 = 0, with solutions /i = 5 and /i = We compute the coordinate t2 and get touch points [5,0] and [^, ^]. The tangents are polars of those points with equations y = 0 and 20a; — 21y = 0. □
4.D.25.   Find tangents equations to circle x2 + y2 = 5 which are parallel with 2a + y + 2 = 0.
Solution. In the projective extension, these tangents intersect at the point at infinity satisfying 2a + y + z = 0, so in point with homogeneous coordinates (1 : — 2 : 0). They are tangents from this point to the circle. We can use the same method as in previous exercise. The conic section matrix is diagonal with the diagonal (1,1, —5) and therefore the touch point (t1 :t2 :t3) of the tangents satisfies t1 — 2t2 = 0. Substitute into the circle equation to get 5/| = 5. Since t2 = ±1, the touch points are
[2,1] and [-2,-1]. □
Solution. Alternative. The point P = v^cos 8, sin 6) lies on the circle for all 6. The tangent line at P is a cos 6 + y sin 6 = y/5. This has slope -(cos 0)/(sin 6») which is -2 provided tan 9 = 1 /2. It follows that P is at either [2,1] or [-2, -1]. □ A tangent line touching the conic section at infinity is called an asymptote. The number of asymptotes of a conic section equals the number of intersections between the conic section and the line at infinity. So the ellipse has no real asymptotes, the parabola has one (which is however a line at infinity) and the hyperbola has two.
4.D.26.   Find the points at infinity and the asymptotes of the conic section denned by
4a2 - 8xy + 3y2 - 2y - 5 = 0
Solution. First, rewrite the conic section in homogeneous coordinates.
4a2 - 8xy + 3y2 - 2yz - hz2 = 0 the homogeneous coordinates (a : y : 0) satisfying this equation, which means
4a2 - 8xy + 3y2 = 0.
It follows that either: ^ = — h oľ y ~ ~ I ■ The conic section is therefore a hyperbola with points at infinity P= (—1:2:0) aQ= (-3 : 2 : 0).
/ 4    -4    0 \ /a\ (-1,2,0) í-4    3    -ll í y\ =-12a + 10y-2 = 0
a
/ 4    -4    0 \ /a\ (-3,2,0)   -4    3    -1      y   =-20a + 18y - 2 = 0 V 0    -1   -5/ W
□
There are further exercises on conic sections on the page 270.
4.D.27. Harmonic cross-ratio. If the cross-ratio of four points lying on the line equals —1, we talk about a harmonic quadruple. Let ABCD be a quadrilateral. Denote by K the intersection of the lines AB and CD, by M the intersection of
275
CHAPTER 4. ANALYTIC GEOMETRY
the lines AD and BC. Further let L, N be the intersection of KM and AC, BD respectively. Show that the points K, L, M, N are a harmonic quadruple. O
276
CHAPTER 4. ANALYTIC GEOMETRY
Exercise solution
4A.9. 2, 3, 4, 6, 7, 8. Find planes the positions of which correspond to each of those numbers.
4.B.12. For the normal vector (a, b, c) of such planes ax + by + cz = d, a + b = 0 since (a,b,c) must be orthogonal to the direction of p. a = d since the plane contains (1,0,0). So the plane is ax — ay + cz = a. If a = 0, then the plane is z = 0, The angle condition requires
cos60° = \ = y^+t^vs
and by choosing a = — b = 1 (vector (0,0,1) does not satisfy the conditions, so by certain multiplication we can get a = — b = 1) we then get, using the angle condition, | v^v/c2+e21 = \. altogether, the sought equations are x — y ± \/6 — 1 = 0.
4.B.17. (-1,3,2).
4.D.I. Line (2t, t, It) + [-5,0, -9].
4.D.2. [3,2,l][8/3,8/3,2/3].
4D.3. The transversal [1,1,1] [—3,1, —1] is of length V20, so the stick is not long enough. 4.D.4. ^ 4.D.5. ^.
277
CHAPTER 5
Establishing the ZOO
which functions do we need for our models? - a thorough menagerie
A. Polynomial interpolation
Let us start with some examples which will hopefully make us more comfortable with polynomials.
5.A.I.   Determine the sum of coefficients of the polynomial
(1 — 2x + 3x2 — x3)r, where r is your age in years.
Solution. The sum of coefficients of a polynomial is equal it's value in 1. Therefore the sum is (1 — 2 + 3 — l)r = V = 1.
□
In this chapter, we start using tools allowing us to model dependencies which are neither linear, nor discrete. Such models are often needed when dealing with time dependent systems. We try to describe them not only at discrete moments of time, but "continuously". Sometimes this is advantageous, for instance in physical models of classical mechanics and engineering. It might also be appropriate and computationally effective to employ an approximation of discrete models in economics, chemistry, or biology. In particular such ideas may be appropriate in relation to stochastic models, as we shall see in Chapter 10.
The key concept is that of a function, also called a "signal" in practical applications. The larger the class of functions used, the more difficult is the development of effective tools. On the other hand, if there are only a few simple types of functions available, it may be that some real situations cannot be modelled at all.
The objective of the following two chapters is thus to introduce explicitly the most elementary functions of real variables. It is also to describe implicitly many more functions, and to build the standard tools to use them. This is the differential and integral calculus of one variable. While the focus has been mainly on the part of mathematics called algebra, the emphasis will now be on mathematical analysis. The link between the two is provided by a "geometric approach". If possible, this means building concepts and intuition independently of any choice of coordinates. Often this leads to a discrete (finite) description of the objects of interest. This is immediate when working with polynomials now.
5.A.2.   Determine the coefficient by a;120 of the polynomial
P(x) = (l-x + x2
Solution.
-a;3 + -
-x4i>)(l + x + x2 + -
P(x) =
l-x50l- x102
(1
1 + x    1 — X x50)(l + x2
2
= (1-
+ x4 + -
„50
1 - XWZ
1-X2 lOCh
+ x
+ x48 - x102
= l + x" + The coefficient by a;120 is —1.
— ■ ■ ■ — x
150
□
1. Polynomial interpolation
In the previous chapters, we often worked with sequences of real or complex numbers, i.e. with scalar functions N —> K or Z —> K, where K is a given set of numbers. We also worked with sequences of vectors over real or complex numbers.
Recall the discussion from paragraph 1.1.6, about dealing with scalar functions. This discussion is adequate to work with functions R —> R (real-valuedfunctions of one real variable), or R —> C (complex-valued functions of one real variable), or sometimes more generally the vector-valued functions of one real variable R —> V. The results can usually be
CHAPTER 5. ESTABLISHING THE ZOO
5.A.3.   Prove that any real solution x0 of the equation a;3 +
s\^l//. px+q = 0(p,q G R) safisfies the inequality 4qx0 <
Solution. Note that xq is the solution of the quadratic equation x0x2 + px + q = 0, therefore its discriminant
p — 4:x0p is non-negative.
□
5. A.4. Let P (x) be a polynomial of degree at most n, n > 1, such that
for k = 0,1,..., n. Find P(n + 1).
Solution. LetQ(x) = (x + l)P(x) - (n + 1 - x). Note that Q(x) has degree n + 1 and the condition from the problem statement now sais that (n + 1) numbers 0,1,..., n are roots of the polynomial, that is Q(x) = K ■ x ■ (x — 1) ■ ■ ■ (x — n). Now we use the two expressions of Q(x) to determine K. On one hand Q(-l) = -(n + 2), but Q(-l) = K ■ ■ (n + 1)! as well. Thus
K =
(-!)"(« + 2)
(n + 1)!
thatisQ(n+l) = (-l)n(n+2). On the other hand from our definition of Q(x) we get Q(n + 1) = (n + 2)P(n + 1). All together P(n + 1) = (-1)™. □
5.A.5. Let be a polynomial with real non-negative coefficients. Prove, that if P(^)P(x) > 1 for x = 1 than the same inequality holds for every positive x.
Solution. Let P(x) = anxn + an-ix71-1 + ■ ■ ■ aix + ao. From the problem statement we have P(l)2 > 1. Further
P(x)P ^j-^j = (anxn + an-1xn~1 H----aix + a0)
■ (ans~n + aTl_ia;_(-rl_1-) + ■ ■ ■ a\x_1 + ag
=   ai + Eaia' (a;J~i+
t=0 i<j n
> ^ a2 + 2 ^ ajOj = P(l)2 > 1
i=0 i<j
where we have used the well known inequality x + A > 2 which holds for any positive real number x (equivalent to
(v^-^)2>0). □
Now we will try to approximate functions by polynomials. Suppose we have incomplete information about an unknown function, namely the values it takes at several points, or the values of its first or second derivatives at those points
extended to cases concerning vector values over the considered scalars, rather than just real and complex numbers. We begin with some easily computable functions.
5.1.1. Polynomials. We can add and multiply scalars. ^ These operations satisfy a number of proper-
ties which we listed in the paragraphs 1.1.1 AJSg~S^ ajjd 1.1.5. if we admit any finite number of these operations, leaving one of the variables as an unknown and fixing the other scalars, we obtain the polynomial functions. We consider the scalars K = R, C, or Q.
Polynomials
A polynomial over a ring of scalars K is a mapping / : K —> K given by the expression
f(x) = anxn + an-1xn~1 H-----h aix + a0,
where a{, i = 0,... ,n, are fixed scalars. Multiplication is indicated by juxtaposition of symbols, and "+" denotes addition. If an =^ 0, the polynomial / is said to have degree n. The degree of the zero polynomial is undefined. The scalars a, are called the coefficients of the polynomial f.
The polynomials of degree zero are exactly the non-zero constant mappings. In algebra, polynomials are more often denned as formal expressions of the aforementioned form of j(x),i. e. a polynomial is denned to be a sequence a0, a1,... of coefficients such that only finitely many of them are nonzero. However, we will show shortly that these approaches are equivalent for our choices of scalars.
It is easy to verify that the polynomials over a given ring of scalars form a ring. Multiplication and addition of polynomials are given by the operations in the original ring K by the values of the polynomials. Hence,
(/ ■ g)(x) = f{x) ■ gix),    if + g)ix) = fix) + g(x),
where the operations on the left-hand side are interpreted in the ring of polynomials whereas the operations on the right-hand side are of the ring of scalars (see the third part of Chapter 11 for detailed algebraic treatment).
5.1.2. Division of polynomials. As already mentioned, the scalar fields used are Q, R, or C. In all of these fields, the following holds:
Euclidean division of polynomials
Proposition. For any two polynomials f of degree n and g of degree m, there is exactly one pair of polynomials q, r such that f = q ■ g + r, where either r = 0, or the degree of r is less than m.
Proof. This is a special simple case of the much more t+ ,, general algebraic result in 12.2.6.  Write fix)
+ ■ ■ ■ + aix + a0 for the polyno-
mial of degree n, and g(x) = bmxm + bm ^      ■ + b\x + bo, with an    0 and bm 0.
+
279
CHAPTER 5. ESTABLISHING THE ZOO
as well. We will try to find a polynomial (of the least degree possible) satisfying these dependencies.
5.A.6.   Find a polynomial P satisfying the following conditions:
P{2) = 1, P(3) = 0, P(4) = -1,P(5) = 6. Solution. First, let us solve this task by creating a system of four linear equations in four variables. Suppose the polynomial is of the form 03a;3 + 02a2 + a\x\ + a®. We know there is exactly one polynomial of degree less than four and satisfying the given conditions.
a0 + 2a1 + 4a2 + 8a3   = 1 a0 + 3ai + 9a2 + 27a3   = 0 ag + 4a 1 + 16a2 + 64<i3   = —1
<2o + 5<2i + 25(22 + 125(23    = 6.
Each equation arose from one of the given conditions.
Another option is to construct the required polynomial from the fundamental Lagrange polynomials.
(see 5.1.4):
(a-3)(a-4)(a-5)
P(x)
1 •
+ (-1) + 6
(2-3)(2-4)(2-5)
(a-2)(a-3)(a-5)
+ 0-(...)
(4-2)(4-3)(4-5) (a-2)(a-3)(a-4)
(5-2)(5-3)(5-4)
= V - I2z2 +
101
z-29.
The coefficients of the polynomial form, of course, the solution of the aforementioned system of linear equations. □ The methods from the previous example can be applied to complex valued polynomials as well:
5A.7. Find a polynomial P satisfying the following conditions:
P(l + i)=i, P(2) = 1, P(3) = -i.
o
5.A.8. For pairwise distinct points a0,..., xn e R, consider the elementary Lagrange polynomials (5.1.4):
_    (a - a0) ■ ■ ■ (a - a^_i(a - xi+1) ■ ■ ■ (a - xn)
(Xi - a0) ■ ■ ■ (Xi - Xi-ij (Xi - Xi+i) ■■■(Xi- xn)'
a £ R, 2 = 0,... , 72. Prove that
n
J2k(x) = l   for all a GR.
i=0
Consider uniqueness. Suppose there are polynomials q, q', r, and r', such that
/ = q ■ g + r = q' ■ g + r'.
Then subtraction gives 0 = (q — q') ■ g + (r — r').
If q = q', then also r = r1. If q =^ q', then the term of highest degree in (q — q') ■ g cannot be replicated in r — r'. This leads to a contradiction. This proves uniqueness.
It remains to prove that / can always be expressed in the desired form. If m > n, then / = 0 ■ g + f satisfies the requirements. So suppose that n > m. The result is proved by induction on the degree of /.
If / is of degree zero, then the statement is trivial. Suppose the statement holds for all polynomials / of degree less than 72 > 0. Put
/2(a) = /(a) - -±x"
. If h(x) is the zero polynomial, then / is of the desired form. Otherwise h(x) is a polynomial of degree less than that of / and so h be written in the desired form as h(x) = q ■ g + r. But then
/(a) = /2(a) + ^xn~mg{x) = {q+ ^xn-m)g{x) + r and the proof is complete. □
If f(b) equals zero for some element b e K, then 0 = f(b) = q(b) ■ 0 + r, so r = 0. Consequently /(a) = (x—b)q(x). b is called a root of the polynomial f. The degree of q is then n — 1. If g also has a root, we can continue and in no more than n steps we arrive at a constant polynomial. It follows that the number of roots of any non-zero polynomial over the field K is at most the degree of the polynomial. Hence the following observation:
Corollary. If the field of scalars K is infinite, then the polynomials f and g are equal as mappings if and only if they are equal as sequences of coefficients.
Proof. Suppose that / = g, i.e. j—g = 0, as a mapping. Then the polynomial (/ — g) (a) has infinitely many roots, which is possible only if it is the zero polynomial. □
Notice that of course, this statement does not hold for finite fields. A simple counter-example is the polynomial a2 + a over Z2 which represents a constant zero mapping.
5.1.3. Interpolation polynomial. It is often desirable to use an easily computable expression for a function C^py^ which is given by its values at some given points 'SSrffe- xo,...,x„. Mostly this would be an approxima-"^rr-* tion of an unknown function represented by the finite values only. We look for such polynomials.
If the values were all zeros, we can immediately find a polynomial of degree n + 1, namely
/(a) = (a - a0)(a - x{) ... (a - xn).
This is zero at these points and only at them. However, there are other polynomials which are zero at the given points. For
280
CHAPTER 5. ESTABLISHING THE ZOO
Solution. Apparently,
n
J2k (10) = 1 + 0 +••• + 0 = 1,
i=0
n
J2k (zi) = 0 + l + --- + 0 = l,
i=0
n
£ U (xri) = o + o + --- + i = i.
i=0
This means that the polynomial £™=0 ^ (x) °f degree not greater than n takes the value 1 at the n +1 points x0,... ,xn. However, there is exactly one such polynomial, namely the constant polynomial y = 1. □
5.A.9. Find a polynomial P satisfying the following conditions:
P(l) = 0, P'(l) = 1, P(2) = 3, P'(2) = 3.
Solution. Once again, we will provide two methods of finding the polynomial.
The given conditions give rise to four linear equations for the coefficients of the wanted polynomial. So if we look for a polynomial of degree less than four, we get the same number of equations and unknown coefficients (let us say P(x) = C13X3 + C12X2 + ciix + ao):
P(l) = a3 + a2 + ai + a0 = 0, P'(l) = 3a3 + 2a2 + ai = l,
P{2) = 8a3 + 4a2 + 2ax + a0 = 3, P'{2) = 12a3 + 4a2 + ax = 3.
By solving this system, we obtain the polynomial P(x) = -2a;3 + 10a;2 - 13a; + 5.
Another solution. We will use fundamental Hermite polynomials:
MW= (l-7j^Zi)(*-l)) (2-*)2
= (2a;-l)(a;-2)2, h\(x) = (5 - 2x)(x - l)2, /i2(a;) = (a;-l)(a;-2)2, h22(x) = (x-2)(x-\)2. Altogether,
P(a;) = 0 ■ h\(x) + 3 ■ h\(x) + 1 ■ /i2(a;) + 3 ■ h22(x) =-2a;3 + 10a;2 - 13a; + 5.
□
instance the zero polynomial, which is the only such polynomial in the vector space of polynomials of degree at most n. The general situation is analogous:
Interpolation polynomials
Let K be an infinite field of scalars. An interpolation polynomial f for the set of (pairwise distinct) points x0,..., xn e K and given values yo,... ,yn e Kis either the zero polynomial, or a polynomial of degree at most n such that f(xi) = yi for all i = 0,1,..., n.
Theorem. For every set of n + 1 (pairwise distinct) points x0,..., xn £ K and given values y0,... ,yn £ K, there is exactly one interpolation polynomial f.
Proof. If / and g am interpolation polynomials with the \1?ff" same defining values, then their difference /»n/ / *  jl    *s a P°lynomial °f degree n which has at least n + l roots, and thus f — g = 0. This
proves uniqueness.
It remains to prove the existence. Label the coefficients of the polynomial / of degree 71:
/ = anxn + ■ ■ ■ + ciix + ag.
Substituting the desired values leads to a system of 71+1 equations for the same number of unknown coefficients a{
a0 + x0ax H-----h (x0)nan = y0
a0 + xnax H-----h (xn)nan = yn.
The existence of a solution of this system is easily shown by constructing the polynomial by using the Lagrange polynomials for the given points xo,... ,xn. (See below).
However, the proof can be concluded by using only basic knowledge from linear algebra. This system of linear equations has a unique solution if the determinant of its matrix is a non-zero scalar (see 2.3.5 and 2.2.11). The determinant is the Vandermonde determinant, which was discussed in the exercise 2.B.7 on page 92.
Since it is verified that for zero right-hand sides, there is exactly one solution, we know that this determinant must be non-zero.
281
CHAPTER 5. ESTABLISHING THE ZOO
5.A.10. Using Lagrange interpolation, approximate cos2 1. Use the values the function takes at the points f, f, and |.
Solution.  First,  we determine the mentioned values:
cos2(f) = 1/2, cos2(f) = 1/4, cos2(f) = 0. Then, we determine the elementary Lagrange polynomials, calculating their values at the given point.
'o(l)
«i(l) =
h(l) Altogether,
^(1) = \ ■
(1		2 J
(f (1-	7T \ ( 7T	2 I 2 J
(I-(1	7T \ / 7T -f)(l	-f) -D
(f	7T \ / 7T 4 A 2	-f)
	3)(t-	2)
8-
(77r-12)(7r-2)
+ 0:
4tt2
= 0.288913.
We may notice we did not need to calculate the third elementary polynomial. The actual value is cos2 1 = 0.291927. □
5.A.11. Joe needs to calculate values of the sine function with a calculator capable of basic arithmetic operations only. As he remembers the sine's values at the points 0, |, j, ^, f and knows that it, \/2, and Vh~ are approximately 3.1416, 1.4142, and 1.7321, respectively, he decided to use interpolation. Help him build an approximate formula, using all of the given values.
Solution. We will construct the elementary Lagrange polynomials:
(x-fXx-fXx-fXx-f)
°U (0-f)(0-f)(0-f)(0-§)
= 1.4783a;4 - 5.8052a;3 + 8.1057a;2 - 4.7746a; + 1,
(s-0)(s-s)(s-f)(s-f)
\6 '^6      4^6 2>
= -13.3046a;4 + 45.2808a;3 - 49.2419a;2 + 17.1887a;,
, , ^ (s-0)(s-f)(s-f)(s-§)
U      UA4      6^4      3A4 2>
= 23.6526a;4 - 74.3070a;3 + 71.3298a;2 - 20.3718a;,
, n = (x-0)(a;-f)(a;-f)(a;-f)
\3      UA3      6>y-3      4A3 2>
= -13.3046a;4 + 38.3146a;3 - 32.8279a;2 + 8.5943a;,
(a;-0)(a;-|)(a;-f)(a;-f)
(f-o)(f-f)(f-f)(f-f)
Since polynomials are equal as mappings if and only if they are equal as sequences of coefficients, the theorem is proved. □
5.1.4. Applications of interpolations. At first sight, it may seem that real or rational polynomials, that is, polynomial functions R —> R or Q —> Q, form a >j very useful class of functions of one variable. We can arrange for them to attain any set of given values. Moreover, they are easily expressible, so their value at any point can be calculated without difficulties.
However, there are a number of problems when trying to use them in practice.
The first of the problems is to find quickly the polynomial which will interpolate the given data. Solving the aforementioned system of linear equations generally requires time proportional to the cube of the number of given points x{. This is unacceptable for large data. We will demonstrate how to overcome this on one popular type of polynomial related to fixed points x0,..., xn:
Lagrange1 interpolation polynomials
The Lagrange interpolation polynomial is expressed in terms of the elementary Lagrange polynomials £i of degree n with the properties
£i (xj
1 i = j 0 ijij
These polynomials must (up to a constant factor) equal the expressions (x — x0)... (x — Xi-{)(x — xi+i)... (x — xn). So
tiix) = rws-sj) = ax)
UjjtÁxi - xi) l'{xl){x-xi)
where £(x) = Yl7=o(x ~ Xi)m ^e desired Lagrange interpolation polynomial is then given by
f(x) = Voio{x) + Uih{x) H-----h yn£n{x).
The usage of Lagrange polynomials is especially efficient when working with different values for the same set of values x{. For in this case, the elementary polynomials £i are already prepared.
One of the disadvantages of this expression is a large sensitivity to inaccuracies in a computation when the differences of the given values x{ are small. This is because division by these differences is required.
Another disadvantage (common to all ways of expressing the unique interpolation polynomial) is poor stability of the values of real or rational polynomials outside of the interval containing all its roots.
Soon we will develop tools for an exact description of the functions' behaviour. But even without such tools, it is
= 1.4783a;4 - 3.4831a;3 + 2.6343a;2 - 0.6366a;.
1 Joseph-Louis Lagrange (1736-1813) was a famous Italian mathematician and astronomer, who contributed in particular to celestial mechanics. His famous Mecanique analytique appeared in 1788. His name appears often even in this elementary textbook.
282
CHAPTER 5. ESTABLISHING THE ZOO
Then, the value of the interpolation polynomial P(x) is
1 \p2 \/3
0 ■ l0(x) + -h{x) + —l2{x) + —k(x) + h{x)
= 0.0288a;4 - 0.2043a;3 + 0.0214a;2 + 0.9956a;.
□
Additional questions: Can Joe use this formula to calculate the sine's values at the interval [§, 7r] ? If not, what should he do?
What would the approximate formulae look like if he used not all five knots, but only the three nearest ones for each point?
5.A. 12. The day after, Joe needed to calculate the binary logarithm of 25. (Actually, he needed the natural ^■St^N toS31^111 °f 25, but since he knows that In 2 is ap-t§ proximately 0.6931, the binary one will do.) So he took the points 16 and 32 (with values 4 and 5, respectively) and constructed the interpolation polynomial (line). P(x) = J_a; + 3, hence P(25) = g = 4.5625. Then, he added the point 8 (with value 3) in order to arrive at a more accurate result. In this case, the interpolation polynomial equals P(x) = -gf^a;2 + j|a; + §, which gives P(25) = 4.7266. Joe wanted to obtain an even more accurate number, so he added two more points, namely 2 and 4 (with values 1 and 2, respectively). How shocked he was when he got the result P(25) = 5.892, which is apparently wrong as the binary logarithm is an increasing function. Can you explain the origin of this error?
Solution. Joe asked Google and learned that the interpolation error can be expressed as
(x - x0)(x -Xl)...(x- xn)
f(x) - Pn(x) =
(n + iy.
where the point £ is not known, but lies in the interval given by the least and greatest knots. The term in the fraction's numerator causes the accuracy to deteriorate by adding farther knots. □
5.A.13. A week later, Joe needed to approximate \/7. He got the idea of reversing the problem and using the inverse interpolation, ie. to interchange the roles of arguments (function inputs) and values (function outputs) and to approximate the value of an appropriate function at the point 0. Describe his procedure.
Solution. The function a;2 — 7 takes 0 at \/7. Joe took the points x0 = 2, x\ = 2.5, and x2 = 3, with the function values —3, —0.75, and 2, respectively. Then he interchanged
clear that, according to the sign of the coefficient of the term with highest degree, the value of the polynomial will rapidly approach plus or minus infinity as x increases (or decreases).
However, the above mentioned sign is even not stable under small changes of the denning values y_. This is illustrated by the following two diagrams, displaying eleven values of the function sin(a;) with two different small changes of values. The interpolated function sin(a;) is the dotted line, the circles are the gently moved values y_ and the uniquely determined interpolation polynomial is the solid line. While the approximation is quite good inside the interval covering the eleven points, it is very poor at the margins.
There is a rich theory about the interpolation polynomials. If interested, consult the special literature.
283
CHAPTER 5. ESTABLISHING THE ZOO
their roles, thus obtaining the elementary Lagrange polynomials
l0(x)
(s + 0.75XS-2) =±x2_lx_2. (-3 + 0.75)(-3 - 2)    45       9 15' 16 2    16 32
l-i(x)   =--x--x H--,
v ' 99       99 33
6   a     3 9 —xl H--x H--.
v ' 55        11 55
For \fl, he got the approximate value 2 ■ lo (0) + 2.5 ■ h (0) +
3 • Z2(0) = ff = 2.6485.
Additional questions: Joe made a mistake while constructing one of the elementary polynomials, try to find it. Does this mistake affect the resulting value? How could we make use of the value of the derivative at the point 2.5? □
Finding a spline through given data is a tedious task for the hand computation (if we are given n (n > 2) points and values in them, then we need to solve An —A linear equations. The matrix of this system is special though (see 5.1.9) and there are algorithms to transform the task to solve actually only n linear equations with n unknowns.
We show some "ad hoc" approach which can even more simplify the problem in a special situation.
5.A.14.   Find a natural spline S which satisfies
S(-l) = 0, S(0) = 1, S(l) = 0.
Solution. The wanted spline consists of two cubic polynomials, let us denote them 5*1 for the interval [—1,0] and 5*2 for the interval [0,1]. The word "natural" requires that the second derivatives of 5*1 = ax3 + bx2 + cx + d and 5*2 = ex3 + fx2 + gx + h be zero at the points —1 and 1, respectively. Applying the definition of spline only, we get eight linear equations. We can reduce the system in the following way: Thanks to the given value at 0, we know that the absolute coefficients of both the polynomials are 1. The resulting spline has to be symmetric along the y axis, otherwise we would get two splines satisfying the condition by the reflection along the axis. But the spline is unique. Thus the only possibility for the common values for the first derivatives of 5*1 and 5*2 at zero is zero, further the second derivatives in zero have to agree, that is b = d, and the symmetry gives also a = —c.
So we have 6*1(2;)
+ bx2 + 1 and S2(x) =
5.1.5. Remark. Numerical instability caused by the closeness of (some) of the points x{ is clearly seen in the system of equations from the proof of the Theorem | 5.1.3. When solving a system of linear equations, instability is closely related to the size of the determinant of the corresponding matrix. This is the Vandermonde determinant V in our case.
Lemma. For any sequence of pairwise distinct scalars
x0,... ,xn G K,
V(x0, ...,xn)=  W {xi
Xk
i>k=0
—ax3 + bx2 +1, Confronting these forms with the conditions
Proof. The proof is by induction on the number of the points x{. The result is true for n = 1. (The problem is completely uninteresting for n = 0). Suppose that the result is true for n — 1, i.e.
n-l
V(x0, xn_i) =    Y[   (Xi Xk)-
i>k=0
Consider the values x0,..., xn_1 to be fixed, and vary the value of xn. Expand the determinant by the last row (see 2.2.9). This exhibits the desired determinant as the polynomial
V(x0, ...,xn) = (xn)nV(x0,xn_i)-
{Xn)n~1V{x0, . . . , Xn-2,Xn) H----•
This is a polynomial of degree 71 since its coefficient at (xn)n is non-zero, by the induction hypothesis. Evidently, it vanishes at any point xn = x{ for i < n because in that case, the original determinant contains two identical rows. The polynomial is thus divisible by the expression
(Xji     Xo)(xn     Xi) • • • (xn     Xn — i),
which itself is of degree 71. It follows that the Vandermonde determinant (as a polynomial in the variable xn) must, up to a multiplicative constant, be given by
V(x0, ...,xn) = c-{xn- x0)(xn - Xl) ■ ■ ■ (xn - Xn-l).
Comparing the coefficients at the highest power in (1) with this expression yields
c = V(x0,... ,xn_i),
which completes the proof. □
Notice that the value of the determinant is small if the points x{ are close together.
5.1.6. Derivatives of polynomials. The values of polynomials rapidly tend to infinite values as the input _ variable grows. Hence polynomials are unable to describe periodic events, such as the values of the trigonometric functions. One could say that we will achieve much better results, at least between the points x^ if we look not only at the function values, but also at the rate of increase of the function at those points.
284
CHAPTER 5. ESTABLISHING THE ZOO
Si(-l) = 0 and Si" (-1) =0 yields the following system of only two linear equations in a and b.
-a + & + 1 = 0, -6a + 2& = 0.
Having solved that, we get S\(x) = — ^x3 — |x2 + 1, S2{x) = \x3 - \x2 + 1. Altogether,
q, s _ [-\x3 - \x2 + 1   pro x e [-1, 0], b[X) ~ \ \x3 - %x2 + 1     pro x £ [0,1].
□
You can use the same trick to solve the following problem.
5A.15. Find a (cubic) spline S which satisfies
S(-l) = 0, 5(0) = 1, 5(1) = 0, 5(-l) = -1, 5(1) = 1.
o
5A.16. Find a polynomial of degree two or less such that its values at the points
x0 = —1,    Xi = l,    x2 = 2
are
yo = 1,   Vi = -3,   y2 = 4, respectively. O
5A.17. Construct the Lagrange interpolation polynomial for
	CM	-1	1	2
Vi	1	-1	-1	1
Then find any polynomial of degree greater than three which satisfies the conditions in the table. O
5A.18. Find a polynomial p(x) = ax3 + bx2 + cx + d which satisfies p(0) = 1, p(l) = 0, p{2) = 1, p(3) = 10. O
5A.19. Construct a polynomial p of degree three or less which satisfies p(0) = 2, p{\) = 3, p{2) = 12, p(5) = 147.
o
5A.20. Let the values yo,... ,yn £ K. at pairwise distinct points x0, - ■ ■ ,xn e K, respectively, be given. How many polynomials of degree exactly n+1 and taking the given values at the given points are there? O
5A.21. Determine the Hermite interpolation polynomials P, Q satisfying
P (-1) = -11, P (1) = 1, P' (-1) = 12, P' (1) = 4; Q (-1) = -9, Q (1) = -1, Q' (-1) = 10, Q' (1) = 2.0
For this purpose, we introduce (only intuitively, for the time being) the concept of a derivative for polynomials. Again, we can work with real, complex or rational polynomials. The rate of increase of a real-valued polynomial j(x) at a point i£l should be related to the values
m f(x + 8x)-f(x)
OX
where Sx is a small value in K expressing the increment of the argument x. Since we can calculate (over an arbitrary ring)
(x + 8x)k = xk + ---+ (k)xl(8x)k-1 + ■■■ + (8x)k,
we get for the polynomial j(x) = anxn H-----h a0, the above
quotient (1) in the form
f(x+8x)—f(x)        nxn~18x-\-----\-(8x)k 8x
-;-=an-;--1-----hai —
8x 8x 8x
= nanxn~x + (n — l)a„_ia;rl~2 H----+ a1 + 8x(...)
where the expression in parentheses in the end of the expression is polynomially dependent on Sx. Clearly, for values Sx very close to zero, there is a value arbitrarily close to the following expression:
Derivatives of polynomials
The derivative of the polynomial j(x) = anxn + ■ ■ ■ + a0 with respect to the variable x is the polynomial
f(x) = nanxn~x + (n- l)a„_ia;rl"2 H----+ a1.
From the definition, it is clear that it is just the value f'(x0) of the derivative which gives a good approximation of the polynomial's behaviour near the point x0. More precisely, the lines
f(x0 + Sx) - f(x0)
y =-j-(x - x0) + f(x0),
8x
that is, the secant lines of the graph of the polynomial going through the points [x0, f(x0)] and [x0 + Sx, f(x0 + Sx)] approach, as Sx decreases, to the line
y = f{x0)(x - x0) + f(x0),
which is the "tangent" to the graph of the polynomial /. This is linear approximation to the polynomial / by its tangent line. Exact meaning to all these concepts is given later.
The derivative of polynomials is a linear mapping which, to polynomials of degree at most n, assigns polynomials of degree at most n — 1.
Iterating this procedure, there are the second derivative /", the third derivative f(3\ and generally after fc-tuple iteration, the polynomial f1-^ of degree n—k. Thus the (n+l)-st derivative is the zero polynomial. This linear mapping is an example of cyclic nilpotent mappings, which are more thoroughly examined in paragraph 3.4.10.
The derivative behaves well also with respect to the multiplication of polynomials. A straightforward combinatorial
285
CHAPTER 5. ESTABLISHING THE ZOO
5A.22. Replace the function / with a Hermite polynomial, knowing following values of /:
Xi	-1	1	2
f(*i)	4	-4	-8
f(Xi)	8	-8	11
o
In the following excercises let yi := f(xi).
5A.23. Without calculation, determine the Hermite interpolation polynomial if the following is given:
x0 = 0, Xl = 2, x2 = 1, yo = 0, yi = 4, y2 = 1, y'0 = 0, y[ = 4, y'2 = 2.
o
5A.24. Find a polynomial of degree three or less taking the value y = 4 at the point x = 1 and y = 9 at x = 2, having its derivative equal to —2 at x = 0 and to 1 at a; = 1. Then find a polynomial of degree three or less taking the value y = 6 at both the points x = 1 and x = — 1 and having its derivative equal to 2 at both these points. O
5A.25. How many polynomials satisfying the following conditions are there? The degree is four or less, the values at
x0 = 5 and x1 = 55 are yo = 55 and yi = 5, respectively, and both the first and second derivatives at the point x0 are zero. O
5A.26. Find any polynomial P satisfying
P(0) = 6,   P(l)=4,   P(2)=4,   P'(2) = l.
o
5^4.27. Construct the natural cubic interpolation spline for the values yo = 1, yi = 0, y2 = 1 at the points x0 = —1, xi = 0, x2 = 1, respectively. O
5.A.2& Construct the natural cubic interpolation spline for the function
f(x) = \x\,   xe [-1,1], selecting the points x0 = —1, x1 = 0, x2 = 1. Q
5A.29. Construct the natural cubic interpolation spline for the points xq = —3, x\ = 0, x2 = 3 and the values yo = —3,
yi = 0, y2 = 3. O
5.A.30. Without calculation, construct the natural cubic interpolation spline for the points x0 = — 1, x1 = 0 a x2 = 2 and the value yo = yi = y2 = 1 at these points. Q
check reveals the derivation property or Leibniz rule for this linear operator:
(f{x)-g{x))' = f{x)-g{x) + f{x)-g'{x).
Actually this is a purely algebraic result (which holds over any ring of scalars!) and you may either check it yourself or consult the formal proof in 12.2.7.
5.1.7. Hermite's interpolation problem. Consider m + 1 yea       pairwise distinct real numbers x0,..., xm, i.e. f^fj      xi    xj for all i    j. It is desired to place poly-/ nomials through given values at these points,
^ir^sp*-J— but to determine also the first derivatives of the interpolating polynomial in these points. Set a y[ for all i. A polynomial / is wanted which will satisfy these conditions on the values and derivatives. picture Msstag!
As in the case of interpolating the values only, we obtain the following system of 2 (m+1) equations for the coefficients of the polynomial j(x) = anxn + ■ ■ ■ + a0:
a0 + x0ai H-----h (x0)nan = y0
a0 + xmax H-----h (xm)nan = ym
ai + 2x0a2 H-----h n(x0)n~1an = y'0
a1+2xma2-\-----\-n(xm)n 1an = y'm.
We could verify that with the choice n = 2m +1, the determinant of this system is non-zero, and thus there is exactly one solution.
The polynomial / can be constructed immediately. Simply create a set of polynomials with values 0 or 1 respectively for the derivatives and the values, in order to express the desired values as th linear combination. We sketch briefly, how to construct them now, leaving the details to the reader.
The elementary Lagrange polynomials serve well for this purpose. The derivative of j(x) = (li(x))2 is 2lli(x)li(x) and thus all xj are roots of this polynomial, except for j = i. Similarly for the derivative f'(x). But a polynomial of degree 2m +1 is wanted. So we consider rather g(x) = (x — Xi)f(x). Now the values will be all zero, while the derivative g'(x) = f (x) + (x — Xi) f (x) has the required properties too. Thus we take hf\x) = (x - Xi)(£i(x))2. This is called the fundamental Hermitian polynomial2 of the second type.
Finally we look for for a polynomial which has zero derivatives at all points x{ with the same values as li at the given points x{. We can apply a very similar trick. Look for polynomials of the form h!^\x) = (1 — a(x — Xi))(li(x))2.
Charles Hermite (1822-1901) was a Frenchman active in many areas of Mathematics. His name is mostly linked to the Hermitian operators and matrices, cf. 3.4.6.
286
CHAPTER 5. ESTABLISHING THE ZOO
5A.31. Construct the complete (i. e., the derivatives at the marginal points are given) cubic interpolation spline for the points xq = —3,   x\ = —2,   x2 = —1 and the values
yo = 0,    y_ = 1,    y2 = 2,    y'0 = 1,    y'2 = 1. O
5A.32. Construct the natural cubic interpolation spline for the function
y:
l
1 + 22'
selecting the points x0 = 0, x\ = 1, x2 = 3. O More problems concerning polynomial interpolation can be found at 346.
B. Topology of real numbers and their subsets
5.B.I.   Find limit, isolated, boundary, and interior points of
the sets N,Q,I = {i£K;0<Kl} y^fJ..        \ as subsets of R.
Solution. The set N. For any n e N, we have that
Ox (n) n N = (n - 1, n + 1) n N = {n}.
Hence, there is a neighborhood of n e N in R which contains only one natural number (the number n), therefore every point n G N is isolated. There are thus no interior points (an isolated point cannot be interior). A point a e R is a limit point of A if and only if every neighborhood of a contains infinitely many points of A. However, the set
Oi(o)nN=(«-l,a + l)nff,   where a G R,
is finite, hence N has no limit points. By finiteness of this set, we have that
Sb := inf I b-n I =     inf     \b-n\>0 for&eRxN.
nGN nGOi(b)nN
Therefore, 0Sb (b) n N = 0, so no b e R \ N is a boundary point of N. We also know that every point which is not an interior point of a given set is necessarily its boundary point. The set of N's boundary points thus contains N, and so it equals N.
The set Q. The rational numbers are a dense subset of the real numbers. This means that for every real number, there is a sequence of rational numbers converging to it. (We can, for instance, imagine the decimal representation of a real number and the corresponding sequence whose i-th term will be the representation truncated to the first i decimal digits. Furthermore, we can suppose that the terms of this sequence are pairwise distinct, for example by deliberately changing the
All Xj will be roots of this polynomial, except for j = i where the value is 1. The derivative is
(/if)^))' = -a^x))2 + (1 - a(x - xftUiixYiix).
All Xj,j=£i, are roots of li(x), thus they are also roots of this polynomial. Finally, at the point x{ we want 0 = —a+2£'i(xi). Thus we choose a = 2l'i(xi).
The combinatorial check that 2£'i(xi) the reader. We summarize:
hermite's 1ST order interpolation polynomial
is left to
The fundamental Hermite polynomials are denned as follows:
£"(x<
1 -
£'{xi)
(x - Xi)
hf\x)   =   (x - Xi) (Ux))2 ,
h(-\x3)
si
where £(x) = niio(a; ~ X{)> ^i(x) *s me elementary Lagrange polynomial. These polynomials satisfy:
1 for i = j 0   for i 7^ j
0
hf\Xj)   = 0
(h^Yix/)   = 81
The Hermite's interpolation polynomial is given by the expression
m
f(x) = '£(yih}(x)+y'ih2i(x)).
5.1.8. Examples of Hermite's polynomials. The simplest example is the one of prescribing the value and the derivative at one point. This determines a polynomial of degree one
f(x) = f(x0) + f'(x0)(x - x0).
This is exactly the equation of the straight line given by the value and slope at the point x0. When we set the values and the derivatives at two points, i.e. yo = f(xo), y'o — f'(xo), yx = f(xx), y[ = f'(xx) for two distinct points x{, we still obtain an easily computable problem.
Consider the simple case when x0 = 0, x\ = 1. Then the matrix of the system and its inverse is
A =
1
0 0
0 0
1 1
\3
1\
1
0
0/
A~x =
f2
-3
0
V1
l
-2 1 0
1\
-1
0
0/
The multiplication A ■ {yo,yi,y'o,y'i) gives the vector (a3, a2, ax, a0)T of coefficients of the polynomial /, i.e.
f(x) = (2y0 - 2yx + y'0 + y[)x3
+ (-3y0 + 3yi - 2y'0 - y[)x2 + y'0x + y0.
287
CHAPTER 5. ESTABLISHING THE ZOO
last digit, or by taking the representation with recurring nines rather than zeros, ie. 0.999... for the integer 1 and so on). The set of Q's limit points is thus the whole R and every point x e R \ Q is a boundary point. Especially, we get that any ^-neighborhood
- - <5, - + <5 I ,   where p,?eZ,g^0,
,97 \Q 1 of a rational number p/q contains infinitely many rational numbers, hence there are no isolated points. The number \/2/\Qn is rational for no n e N. Supposing the contrary (again, p, q e Z, q 0)
V2     P     .      R    10"p — =-,   ie.   V2 =-,
10™     q q
we arrive at an immediate contradiction as we know that the number \/2 is not rational. Every neighborhood of a rational number p/q thus contains infinitely many real numbers p/q + \/2/10n (n e N) which are not rational (Q, as a field, is closed under subtraction). Therefore, every point p/q e Q is boundary as well, and there are no interior points of the set Q.
The set X = [0,1). Let a e [0,1) be an arbitrary number. Apparently, the sequences {a+^}^Li,{l— -^}n°=i converge to a and 1, respectively. So we have easily shown that the set of X's limit points contains the interval [0,1]. There are no other limit points: for any b </ [0,1] there is S > 0 such that Os (b) n [0,1] = 0 (for b < 0 it suffices to take 8 = -b, and for b > 1 we can choose S = b — 1). Since every point of the interval [0,1) is a limit point, there are no isolated points. For a e (0,1), let Sa be the less oneofthe two positive numbers a, 1 — a. Considering
Oga (a) = (a-8a,a + Sa) C (0,1),    a e (0,1),
we see that every point of the interval (0,1) is an interior point of X. For every S e (0,1), we have that
Os (0) n [o, l) = (-5, «s)n[o,i) = [o,<5),
Os (i) n [o, l) = (l - s, l + s) n [o, l) = (l - s, l),
so every ^-neighborhood of the point 0 contains some points of the interval [0,1) and some points of the interval (—<5,0), and every ^-neighborhood of 1 has a non-empty intersection with the intervals [0,1), [1,1 + <5). Therefore, 0 and 1 are boundary points. Altogether, we have found that the set of X's interior points is the interval (0,1) and the set of X's boundary points is the two-element set {0,1}, as we know that no point can be both interior and boundary and that a boundary point must be an interior or limit point. □
5.1.9. Spline interpolation. We can prescribe any finite j/^ number of derivatives at the particular points '*^Tt^' and a convenient choice for the upper bound on 'r^,^— the degree of the desired polynomial, leading ^to a unique interpolation. We will not consider the details here. Unfortunately, these interpolations do not solve the problems mentioned already - complexity of the computations and instability. However, a smarter usage of derivatives allows an improvement.
As seen in the diagrams demonstrating the instability of interpolation by a single polynomial of sufficiently large degree, small local changes of the values may dramatically affect the overall changes of the behaviour of the resulting polynomial. In particular, this may happen outside of the interval covered by the points x{. We try gluing together polynomial pieces of low degree.
The simplest possibility is to interpolate each pair of adjacent points by a polynomial of degree at most one. This corresponds either to the interpolation by values with two points, or by guessing the slope and employing Hermite's first order interpolation at a single point. This is a common way of displaying data. This means that derivatives will be constant on each of the segments, with a discontinuous 'jump' at the given points. There is no freedom for improvements.
A more sophisticated method is to prescribe the value and the derivative at each point. We have then four values for each pair of neighbouring points. As seen earlier, this uniquely determines Hermite's polynomial of degree three. This polynomial can then be used for all the values of the input variable between the distinguished points x0 < x\. Such a piece-wise polynomial approximation has the property that the first derivatives will be compatible (equal) at the meeting points Xi in the interval [xq, x\].
In practice, mere compatibility of the first derivatives is often insufficient. Consider for instance railway tracks, where the second derivative corresponds to acceleration. Discontinuous jumps would be very undesirable for the second derivative. Furthermore, the values of the first derivatives are usually predetermined. So instead of requiring fixed values of the first derivatives, we insist on equality of both first and second derivatives at adjacent pieces of the cubic polynomials, as well as fixing the values at the given points, This requirement yields the same number of equations and unknowns, and so the problem is solvable similarly to the 1st order Hermite interpolation problem:
288
CHAPTER 5. ESTABLISHING THE ZOO
5.B.2. Determine the suprema and infima of the following
sets in R:
yl=(-3,0]U(l,7r)U{6};   B= j^-n <en|; C = (-9, 9) nQ.
O
5.B.3. Find sup A and inf A for
o
5.B.4. The following sets are given:
N = {1,2,...,n, ...},   M = |-i;neNJ, J=(0, 2]U[3,5]x{4}.
Determine inf N, sup M,mi J and sup J" in R. O
5.B.5. Find a set M C R which does not have an infimum in R but has a supremum there. Similarly, find a set N C R which does not have a supremum in R but has an infimum there. O
5.B.6. Find a subset X of the set R such that sup X < inf X.
o
5.B.7. Find sets A, B, C C R such that+nB = 0, +nC = 0,
BnC, = 0,sup+ = infB = infC, = supC. O
C. Limits
In the subsequent exercises, we will deal with calculating limits of sequences, that is what the sequences "look like at infinity". Then, if we were to determine the n-th term of a given sequence for a very large n, the limit of the sequence (supposing it exists) can approximate it very well. We devote much space to computation of limits of sequences (and limits of functions) in this exercise column, that is why they begin earlier (and end later) than in the part concerning theory.
Let us begin with limits of sequences. The needful definitions can be found at page. 293.
Cubic splines
Let no < x1 < ■ ■ ■ < xn be real values at which the required values y0,... ,yn are given. A cubic interpolation spline for this assignment is a function S : R —> R which satisfies the following conditions:
• the restrictions of S on the intervals [xi-i,Xi] are polynomials Si of degree at most three, for alii = 1,..., n
• Si(xi-i) = yi-i and Si(xi) = y{ for alii = 1,... n,
• Siixi) = 3i+1(xi) for alii = 1,..., n — 1, . S'l (Xi) = S'l+1 (Xi) for all i = 1,..., n - 1.
The cubic spline3 for n + 1 points consists of n cubic polynomials. There are An free parameters (the first condition from the definition). The other conditions then yield 2n + (n — 1) + (n — 1) more equalities. Two parameters remain free. The values of the derivatives at the marginal points may be prescribed explicitly (the complete spline), or the second derivatives can be set to zero (the natural spline).
Unfortunately, the computation of the whole spline is not as easy as with the independent computations of Hermite's cubic polynomials because the data mingles between adjacent intervals. However, ordering the variables and equations properly, gives a matrix of the system such that all of its nonzero elements appear only on three diagonals. These matrices are nice enough to be solved in a time proportional to the number of points, using a suitable numerical method. The results are stunning.
For comparison, look at the interpolation of the same data as in the case of the Lagrange polynomial, now using splines. The spline is the solid line, the interpolated function is again the dotted line.
Although the diagrams look nearly identical, the data is different.
The name comes from the name of an elastic ruler used by engineers to draw smooth curve interpolation points. In fact, the requirement on the equality of the first and second derivatives is a good model for natural elasticity behaviour.
289
CHAPTER 5. ESTABLISHING THE ZOO
5.C.I.   Calculate    the    following    limits of quences:
i) lim 2n +?"+1,
n—»-oo 77,-1-1
ii) lim 2"2+3"+1,
n—too
Tt+l
iv) lim^.oo ll+l-l, v)   lim v'4"2+",
vi)   lim V4n2 + n — 2n.
Solution.
i) lim 2"2+3_+i = lün
ii) lim 2" ^"J,1 = lim
' „^oo   377^+77+1 -    . -
271+3+J-
——
71—>-00 ' -n
2+^- + -%-
77+1
Ul) J-^o 277^+3,7+1 n—^ 2n+3+J
lim
2
31
i = 0.
iv)
_2—n
lim - =     lim \.
77—S-—OO 2n + 2_71 77—S- —OO ^
+ 1
= -1
se- 2. Real numbers and limit processes
Polynomials and splines do not supply a sufficiently large stock of functions to express many dependencies.
Actually, the first problem to solve is how to define the values of more general functions at all. In principle, all we can get with a finite number of multiplications and additions is polynomial functions. Perhaps division by polynomial quantities, and some efficient manipulation with rational numbers can be done. However, we cannot restrict ourselves to rational numbers. For instance, \/2 is not a rational number.
Thus the first step is a thorough introduction to limit processes. We define precisely what it means for a sequence of numbers to approach another number.
An important property of polynomials is the "continuous" dependency of their values on the input variable. We expect intuitively that if x is changed by a little, then the value of j(x) also changes only a little. This behaviour is not possessed by piece-wise constant functions / : R —> R near the "jump discontinuities". For instance, the Heaviside function'
0 for all x < 0, f(x)={l/2   for 2 = 0,
1 for all x > 0
v)    By the squeeze theorem (5.2.12):
Vn e N : ^
vi)
r,—2~i— ,/4n2+n+T7
Then lim
n—yoo
>-±VL   =    lim    277   = 2>
71 n—tOO 71
lim =
So lim v'4"2+"
77—>-00 U
2"+l = 2
2 as well.
lim
77—>-00
lim \JAn2 + n — 2n =
lim
77—>-00
(VA~n?+~n - 2n) (y'An2 + n + 2n) y/An2 +n + 2n
lim =- =
77^00 y^fjZ _|_ n _|_ 2n
lim
77—S-OO \/4772+77
+ 2
□
5.C.2. Let c e R+ (a positive real number). We will show that lim ^/c = 1.
77—>-oo
Solution. First, let us consider c > 1. The function tfc is decreasing (in n), yet all its values are greater than 1, hence the sequence >/c has a limit, and this limit is equal to the infimum of the sequence's terms. Let us suppose, for a while, that thus limit is greater than 1, that is 1 + e for some e > 0.
Then by the definition of a limit, all the sequence's terms will
2
eventually (from some index m on) be less than 1 + e +
has this type of "discontinuity" for x = 0. We formalize these intuitive statements.
5.2.1. Real numbers. We have dealt with algebraic properties of real numbers, summarized by the claim that R is a field. However, we have also used the relation of the standard (total) ordering of the real numbers, denoted by "<". See the paragraph 1.6.3 on the page 43.
The properties (axioms) of the real numbers, including the connections between the relations and other operations, are enumerated in the following table. The bars indicate how the axioms guarantee that the real numbers form an abelian (commutative) group with respect to addition, that R \ {0} is an abelian group with respect to multiplication, that R is a field, that the set R together with the operations +, ■ and the order relation is an ordered field. The last axiom can be considered as claiming that R is "sufficiently dense".
Oliver Heaviside was an unconventional English electrical engineer (1850-1925) with an innovative and very original approach to practical mathematical modelling. His famous sayings include "Mathematics is an experimental science, and definitions do not come first, but later on". Or, defending his incomplete argumentation, '7 do not refuse my dinner simply because I do not understand the process of digestion ". Is this suggestive of the methodology of this textbook?
290
CHAPTER 5. ESTABLISHING THE ZOO
especially y/c < 1 + e +      . But then we have that
2V^=\J\fc<\jl + e+e- = l+£-<l + e,
which contradicts our assumption that 1 + e is the infimum of the considered sequence.
The theorem is trivial for c = 1, and for a number c G (0,1) it follows from the above, if we invoke the theorem for the number 1 /c. □
Solution. Apparently, we have \nfn > 1, n g N. So we can
set
s/n = 1 + an   for certain numbers   an > 0, ti g N. By the binomial theorem we get that
ti = (1 + a„)n = 1 + Qon+ + "' + <'
n > 2 (71 g N).
Hence we have the bound (all the numbers an are non-negative)
71 \   2 n
71 >   ,   a„ = —
(71-1)
V2/
which leads to
0 < an < By the squeeze theorem,
71- r
i2, 71>2(tigN),
71 > 2 (71 g N).
lim 0 < lim a„ < lim
71-1
Thus we have obtained the result
lim \/n = lim (1 + 1
= 1 + 0 = 1.
We can notice that by further application of the squeeze theorem, we get
1 = lim 1 < lim ä/c < lim s/n = 1,
n—too n—too n—too
for every real number c > 1.
Axioms of the real numbers
(Rl)   (a + b) + c=a+(b + c), for all a, b, c G R
(R2)   a + b = b + a, for all a, b G R
(R3)   there is an element 0 G R such that for all a G R,
a + 0 = a
(R4)   for all a G R, there is an additive inverse (—a) G R suchthat a + (-a) = 0
(R5)   {a-b)-c = a-{b-c), for all a, b, c G R
(R6)   a ■ & = b ■ a for all a, b G R
(R7)   there is an element 1 £ R, 1 / 0, such that for
all a G R, 1 ■ a = a (R8)   for all a G R, a 7^ 0, there is a multiplicative
inverse a-1 G R such that a ■ a-1 = 1
(R9)   a ■ (& + c) = a ■ b + a ■ c, for all a, b, c G R
(RIO)   the relation < is a total order, i.e. reflexive, antisymmetric, transitive, and total on R
(Rl 1)   for all a, b, c G R, a < b implies a + c < b + c (R12)   for all a, & G R, a > 0 and b > 0 implies a-b > 0
(R13)   every non-empty set A c R which has an upper bound has a least upper bound.
The concept of a least upper bound from axiom (R13), also called the supremum), is very important. It makes sense for any partially ordered set. This is a set with a (not necessarily total) ordering relation. Recall that an ordering relation is a binary relation on a set which is reflexive, antisymmetric, and transitive; see the paragraph 1.6.3.
Supremum and infimum
Consider a subset A c B in a partially ordered set B. An upper bound of the set A is any element b g B such that b > a for all a g A.
Similarly, a /ower bound of the set 4 is an element b g .B such that b < a for all a e A.
The least upper bound of the set A, if it exists, is called its supremum and it is denoted by sup A. Similarly, the greatest lower bound, if it exists, is called the infimum and it is denoted by inf A.
Thus, the last axiom (R13) from the table of properties of the real numbers can be reformulated as follows: Every non-empty bounded set A of real numbers has a supremum. This means that if there is a number a which is larger than or equal to all numbers x g A, then there is a smallest number with this property.
For instance, the choice A = {x g '
< 2} gives the
□
supremum sup A which is called \/2; the square root of two.
An immediate consequence of this axiom is the existence of the infima for any non-empty set of real numbers bounded from below. Observe that changing the sign of all the numbers in a set interchanges suprema and infima.
For the formal construction, it is necessry to know whether or not there is such a set R with the operations and ordering relation which satisfies the thirteen axioms. So far,
291
CHAPTER 5. ESTABLISHING THE ZOO
5.C.4.   Calculate the limit
lim (V2-</2-V2--- 2V2
Solution. To determine the limit, it is sufficient to express the terms in the form
l    l    l        l       liiiii   I l 22 -2* - 25 ■ ■ ■22W = 22"+* + sH—h?H~.
Thus we get
lim (V2-</2-</2---'V2) =
n—yoo V / i__i_i._i_i._i_    I l
hm 22   4   s       271 —
By the well-known formula for the sum of geometric series,
71=1    N     / 4
whence it follows that
lim (V2- #2-      ■■■ V2) = 21 =2.
□
5.C.5. Determine
,1      2 n-2 n-1
llm      — + — + "' + —2- + —2~
n-s-oo \ 71z       71z 71z 71z
o
5.C.6. Calculate
lim
\Jn? — lln2 + 2 + \JvJ — 2ti5 — ti3 — n + sin2 ti
2 - ^5n4 + 2ti3 + 5
5.C.7. Determine the limit
n! + (n-2)!-(n-4)!
lim -r-^--1-----TT-1—-
n-)-oo    n&u + ?i! — (n — 1)!
o
o
5.C.& Find two sequences (let use denote their terms by xn and yn (n G N), respectively) having infinite limits and such that
lim (xn + yn) = 1,        lim (xnyn) = +oo.
n—¥oo n—yoo
O
5.C.9. Determine the limit points of the sequence given by
(-l)n2n
V^n2 + 5n + 3'
ri G N.
O
5.C.10. Calculate
lim sup an   and   lim inf an
only the rational numbers have been constructed formally. These form an ordered field. This means that Q satisfies the axioms (Rl) - (R12), and this can easily be verified.
We do not go into details here of the consistent construction of the real numbers now. We will be satisfied with an intuitive idea of the real line, and we will work with the axioms (Rl) through R(13). But we shall come back to this issue in a more general framework in chapter 7, see the paragraph 5.2.4 and the discussion started in 7.3.6. Actually, we shall see, that if the real numbers can be constructed, then the construction is unique up to isomorphism. This is a bijection preserving all algebraic structures of two different realizations of the field R).
5.2.2. The complex plane. Recall that the complex numbers are given as pairs of real numbers. We usually write them as z = Re z + i Im z. Therefore, the plane C = R2 is an appropriate image of the — complex numbers. With addition and multiplication, the complex numbers satisfy the axioms (R1)-(R9) and thus form a field. There is, however, no natural ordering defined on them which would satisfy the axioms (R10)-R(13). Nevertheless, we work with them, since extending real scalars to the complex numbers is highly advantageous or sometimes even necessary.
There is an important operation on the complex numbers called complex conjugation. It is the reflection symmetry with respect to the line of real numbers. We denote it by a bar over the number z G C:
z = Re z — i Im z.
It changes the sign of the imaginary part. Since for z = x+iy,
z ■ z = (x + iy)(x — iy) = x2 + y2,
this value expresses the squared distance of the complex numbers from the origin. The square root of this non-negative real number is called the absolute value of the complex number z; written
(1) \zf = z-z.
The absolute value can be defined on any ordered field of scalars K. Define the absolute value \a\ as follows:
a if a > 0 — a   if a < 0.
For any numbers a, b G K,
(2) \a + b\< \a\ + \b\.
This property is called the triangle inequality. It holds also for the absolute value of the complex numbers.
For the fields of rational numbers or real numbers, both of which are subfields of the complex numbers, both definitions of the absolute value coincide. The absolute value must be understood in the context of which ever field K of rational, real, or complex numbers is involved. The triangle inequality holds in all these cases.
292
CHAPTER 5. ESTABLISHING THE ZOO
if
sin —,    n e N.
n2 + 9 4 '
o
5.C.11. Determine
liminf ( (-1)™ ( l + A +sin^
o
5.C.12.   Now let us proceed with limits of functions. The definition can be found at page 300. Determine
(a)
(b)
(c)
lim sin a;;
x—vk/3
x   + x — 6
lim —t.-;
x^2 x2 - 3x + 2
lim   ( arccos-
x-s-+oo y x + 1
(d)
lim arctg—,     lim arctg a;,     lim arctg (sina;)
x—y — oo
x—y — oo
Solution, (a) Let us remind that a function / is, by definition, continuous at a given point x iff the limit of / at x is equal to the function value f(x). However, we know that the function y = sin x is continuous at every real number. Thus we get that
,.      .         .  it V3 lim sin x = sin — = -.
x—yir/3 3 2
(b) The immediate substitution x = 2 leads to both zero numerator and zero denominator. Despite that, the problem can be solved very easily. The reduction
a;2 + a;-6            (x - 2) (x + 3)           x + 3 lim -r.- = lim--r—.-r = lim-
x^2 x2 — 3x + 2       x^2 (x — 2) (x — 1)       x^2 x — 1
2 + 3 c
leads to the correct result (thanks to continuity of the obtained at function at the point x0 = 2). Let us realize that the limit of a function can be calculated from the function values in an arbitrarily small deleted neighborhood of a given point xq and that the limit does not depend on the function value at the point. We can thus make use of multiplying or reducing by factors which do not change the function values in an arbitrarily selected deleted neighborhood of the point x0.
5.2.3. Convergence of a sequence. We wish to formalize the notion of a sequence of numbers approaching a limit. The key object of interest is a sequence of numbers a{, where the index i usually goes through all the natural numbers. Denote the sequences loosely either as a0, ai,..., or as infinite vectors (a0,ai,...),oras(ai)g1.
Cauchy5 sequences
Consider a sequence (a0, a1,...) of elements of K such that for any fixed positive number e > 0,
\di — cij| < e.
for all but finitely many terms a{, cij of the sequence.
In other words, for any fixed e > 0, there is an index N such that the above inequality holds for all i,j > N. Loosley put, the elements of the sequence are eventually arbitrarily close to each other. Such a sequence is called a Cauchy sequence.
Mum Sequence
O      0F CO ftn&{
Intuitively, either all but finitely many of the sequence's terms are equal, (then \a{ — cij\ = 0 from some index N on), or they "approach" some particular value. This is easily imaginable in the complex plane. Choose an arbitrarily small disc (with radius equal to e). Suppose a Cauchy sequence is given. It must be possible to put the disc into the complex plane in such a way that it covers all but a finitely many of the elements of the infinite sequence a{. Imagine that the disc has very small radius, and contains a number a; see the diagram. If such a value aeK exists for a Cauchy sequence, we would expect the sequence to have the property of convergence:
Convergent sequences
A sequence (aj) So converges to a value a iff for any positive real number e,
\cii — a\ < e
for all but finitely many indices i. Notice that the set of those i for which the inequality does not hold may depend on e). The number a is called the limit of the sequence (aj)g0.
If a sequence a{ e K, i = 0,1,..., converges to a e K, then for any fixed positive e, | a{ — a < e for all i greater than a certain N e N. By the triangle inequality,
a^ — a,j \ = \cii — a/v+aAr — cij < \cii — a/v| + |fl/v — cij\ < 2e.
for all pairs of indices i, j > N. Thus:
5Augustin-Louis Cauchy (1789-1857) was a French mathematician pioneering a rigorous approach to infinitesimal analysis.He was very productive, wrote about 800 research articles. There are dozens of concepts and theorems named after him.
293
CHAPTER 5. ESTABLISHING THE ZOO
(c) By moving the limit inwards twice, the original limit transforms to
(arccos ( lim -
y        yx-^+oo x + 1,
It can easily be shown that
lim
1
0.
x-s-+oo x + 1
As the function y = arccos x is continuous at the point 0 and takes the value tt/2 there, and the function y = x3 is continuous at tt/2, we get that
lim   (arccos—-— ]   = (arccos (   lim ——
x-s-+oo V x + 1 / V \x-s-+oo x + 1 ,
1)
2/
3
(d) The function y = arctg x has properties which are "useful when calculating limits" - it is continuous and injective (increasing) on the whole domain. These properties always (with no further conditions or limitations) allow to move the examined limit into the argument of such a function. Therefore, let us consider
arctg f  lim  — ] ,   arctg f  lim x4 ] ,
\ x—y — oo X J \x——oo /
arctg I   lim sin a;
x—y — oo
Apparently,
lim   — = 0, lim xA = +oo
x—y — oo x x—y — oo
and the limit lima;_t._0o sin x does not exist, which implies lim arctg — = arctg 0 = 0,
x—y — oo x
lim arctg xA =   lim arctg y = —
x—y — oo y—>-+oo 2
and the last limit does not exist, either. □
5.C.13.   Determine the limit
1 — COS X
lim ■
Solution.
x-s-o x2 sin(a;2)
l-cosx , 2shr(§ lim —=—;——^rr = lim
x-s-o x2 sin(a;2)     x-s-o x2 sin(a;2) lim
2sin2(f)
i(a;2)
l/^smfj
lim
1
2 \x^o § / x^o sin2 (a;2) 1
Lemma. Every convergent sequence is a Cauchy sequence.
However, in the field of rational numbers, it can happen that for a Cauchy sequence a corresponding value a does not exist. For instance, the number \/2 can be approached by a sequence of rational numbers a{, thereby obtaining a sequence converging to \/2, but the limit is not rational.
Ordered fields of scalars in which every Cauchy sequence converges are called complete. The following theorem states that the axiom (R13) guarantees that the real numbers are such a field:
Theorem. Every Cauchy sequence of real numbers a, converges to a real number a£l,
Proof. The terms of any Cauchy sequence form a 0 bounded set since any choice of e bounds all -I A, but finitely many of them. Let B be the set of ^- those real numbers x for which x < cij for all
but finitely many terms a, of the sequence. B has an upper bound, and thus B has a supremum as well, by (R13).
Define a = sup B. Fix e > 0, and choose N so that \a{ — a,j\ < e for all i,j > N. Then a, > a/v — e and a,j < ajv + e for all indices j > N, and so — e belongs to B, while a7v + e does not. Hence \a — a^\ < £, and thus
|a — Oj| < |a — ajv| + |ajv ~ aj\ < 2e
for all j > N. So a is the limit of the given sequence. □
Corollary. Every Cauchy sequence of complex numbers Zi converges to a complex number z.
Proof. Write zt = ai+ibi. Since la^ — a,-
and similarly for the values bi, both sequences of real numbers di and bi are Cauchy sequences. They converge to a and b, respectively. It is easily verified that z = a + ib which is the limit of the sequence z{. □
5.2.4. Remark. The previous discussion proposes a con-^ struction method for the real numbers. Proceed similarly to building the integers from the natural numbers (adding in all additive inverses). Build the rational numbers from the integers (adding all multiplicative inverses of non-zero numbers). Then "complete" the rational numbers by adding in all limits of Cauchy sequences.
Cauchy sequences (aj)£o ^ (bi)°l0 of rational numbers are equivalent if and only if the distances \a{ — b{ converge to zero This is the same as the condition that merging these sequences into a single sequence also yields a Cauchy sequence. For example, a sequence can be formed by selecting alternately terms from the first sequence and the second sequence. Check the properties of the equivalence relations. Clearly the relation is reflexive, it is symmetric (since the distance of the rational numbers is symmetric in its arguments) and transitivity follows easily from the triangle inequality. Thus, we may define R as the set of equivalence classes on the above set of sequences.
294
CHAPTER 5. ESTABLISHING THE ZOO
The previous calculation must be considered "from the back". Since the limits on the right-hand side exist (no matter whether finite or infinite) and the expression \ ■ oo is meaningful (see the note after theorem 5.2.13), the original limit exists as well. If we split the original limit into the product
lim(l — cos 2) ■ lim —z-r^TTi
we would get the 0 ■ 00 type, which is an indeterminate form, but this tells us nothing about existence of the original limit.
□
5.C.14.   Determine the following limits:
v sin(sin x)
Imix^o ——'-,
Solution.
i)
lim
x-2
2 yfx
= lim
x-2
^(x-2)(x + 2)
lim ,
y/x  + 2
Ü)
iii)
sin (sinx) (5.2.20) ,. siny lim- =   lim- = 1,
x-s-o       x y-tO y
where we made use of the fact that lim sin x = 0.
x-s-0
lim
sin2 x
lim sin x ■ lim-
x—>0 x—>0 X
0-1 = 0,
x-s-0 X
again, the original limit exists because both the right-hand side limits exist and their product is well-defined, iv) One must be cautious when calculating this limit. Both one-sided limits exist, but are different, which implies that the examined limit does not exist:
lim e*
x—
1 • — 11
lim e* = e
x-s-0_
OO,
= 0.
□
In the following exercise, we will be concerned with so-called indeterminate forms. We recommend perceiving indeterminate forms as a helping concept which ' "\ is only to facilitate the first approach to limit calculations because the obtained indeterminate form only means that one "has found out nothing". We know the limit of a sum is the sum of the limits, the limit of a product is the product of the limits, and the limit of a quotient is the quotient of the
We introduce algebraic structures on this set R and check j.',, their properties. Of course, the rational numbers can be represented by constant sequences, so that Q C i, as expected. Next, define the sum and product of equivalence classes by taking the sum and product of sequences representing them, respectively. It is easy to check that the results represent a class independent of the choices.
Ordering is dealt with similarly. Here it is required to prove that a < b if and only if there are representatives with di < b{. Finally it is necessary to show that all Cauchy sequences in R converge. We do not go into details here and advise the reader to return back now and check all the details when going through the full discussion of the completion of metric spaces in the paragraph 7.3.6. The arguments used there with the real scalars replaced by rational ones provide an adequate proof. The arguments proving that the axioms (R1)-(R13) define the real numbers uniquely up to isomorphism are also to be found there.
5.2.5. Closed sets. For further work with real or complex ■ numbers, we need to understand the notions of closeness, boundedness, convergence, and so on. These concepts belong to the topic "topology"6. As before, we work with K = R or K = C. We advise the reader to draw many diagrams for all the concepts and their properties for both the real line and the complex plane.
For any subset A of points in K, we are interested not only in the points belonging to a e A, but also in the ones which can be approached by limits of sequences in A.
Limit points of a set
Let A be a subset of K. A point x e K is called a limit point of A if and only if there is a sequence a0, a 1,... of elements in A such that all its terms differ from x, yet its limit is x.
Notice that a limit point of a set may or may not belong to the set.
For every non-empty set A c K and a fixed point x e K, the set of all distances \x — a\,a e A, is a set of real numbers bounded from below, and so it has an infimum d(x,A), which is called the distance of the point xfrom the set A. Notice that d(x, A) = 0 if and only if either x e A or if x is a limit point of A. (We suggest the reader proves this in detail directly from the definitions.)
The name of this mathematical discipline comes from the Greek "studying the shape" (topos + logos). The main concepts are built on the formalism of open and closed sets, compactness etc. We use the same names here but only in the realm of real and complex numbers. Later on, we go further in metric spaces in chapter 7.
295
CHAPTER 5. ESTABLISHING THE ZOO
limits, supposing the particular limits exist and do not lead to one of the following expressions oo — oo, 0 ■ oo, 0/0, oo/oo, which are called indeterminate forms. For completeness, let us add that these rules can be combined and that an expression containing an indeterminate form is itself considered an indeterminate form. For instance, the forms
Closed sets
—oo + oo = oo — oo,
t-vn— = 0 ■ (oo — oo
( — 00)^+00 v
3+00
0 r>
are all indeterminate, but the forms
-00 — 00,
0
0
3 + oo'    (—oo)3 — 00 can be called "determinate" (one can immediately determine the limit - they correspond to the values —00,0,0, respectively).
5.C15. Calculate
(a)   limx^2 jf^if,
(b) lim^a
(c) limx_
- + OO
(2 + i)%   (d) limx_
■+00 •
Solution. In exercise (a), the quotient of the numerator and the denominator gives us 4/0. Expressions containing division by zero are inappropriate (later, we should be able to avoid them). Yet it leads to the result, it is not an indeterminate form. We may notice that the denominator approaches zero from the right (for 1 / 2we have that (x — 2)6 > 0). We write this as 4/ + 0. Thus the numerator and denominator are both positive in some deleted neighborhood of the point xq = 2 and one can say that the denominator, at the limit point, is "infinitely times less" than the numerator, that is
x + 2
lim--t77 = +00,
x^2 (x - 2)6
which corresponds to setting 4/ + 0 = +00 (similarly, we can set 4/ — 0 = —00).
When calculating the limit of (b), one can proceed analogously. Since the numbers have the same sign, we get that
x + 2 , x + 2
lim--ttt = +OO =£ —OO =    lim--ttt,
x^2+(x-2f ^ x^2-(x-2f
so the examined limit does not exist. We can write 4/ ± 0 (or, more generally, a/i0,a/0,a£ R*), which is a "determinate form". When thoroughly distinguishing the symbols +0 and —0 from ±0, a/ ± 0 for a ^ 0 always means the limit in question does not exist.
Exercises (c), (d). If j(x) > 0 for all considered x e R,
then
The closure A of a set A c K is the set of those points which have zero distance from A (note that the distance from the empty set of points is undefined, therefore 0 = 0).
A closed subset in K is a set which coincides with its closure. A set is closed if it contains all of its limit points. On the real line, a closed interval
[a, b] = {x e R, a < x < b}
of real numbers, where a and b are fixed real numbers is a closed set.
The sets (—00,6], [a, 00), and (—00,00) are also closed sets.
A closed set may also be formed by a sequence of real numbers without a limit point, or a sequence with a finite number of limit points together with these points.
The unit disc (including its boundary circle) in the complex plane is another example of a closed set.
An arbitrary intersection of closed sets is again a closed set. A finite union of closed sets is again a closed set. Indeed, if all of the points of some sequence belong to the considered intersection of closed sets, then they belong to each of the sets, and so do all the limit points. However, if we wanted to say the same about an arbitrary union, we would get in trouble: singleton sets are closed, but a sequence of points created from them may not be. On the other hand, if we restrict our attention to finite unions and consider a limit point of some sequence lying in this union, then the limit point must also be the limit point of any subsequence, especially the one lying in only one of the united sets. As this set is assumed to be closed, the limit point lies in it, and thus it lies in the union.
5.2.6. Open sets. There is another useful type of subset of the real numbers: open intervals
(a, b) = {x e R; a < x < b},
where, again, a and b are fixed real numbers or infinite values ±00. It is an open set, in the following sense:
Open sets and neighbourhoods of points
An open set in K is a set whose complement is a closed set.
A neighbourhood of a point a G K is an open set O which contains a. If the neighbourhood is denned as
0${a) = {x e K, I a; - a\ < 5}
for some positive number <5, then we call it the <5-neighbourhood of the point a.
Clearly, for K = R, Os(a) is an open interval of length 25 centered at a, while in the complex plane it is the interior of the circle with radius S and center at a.
Notice that for any set A, a e K is a limit point of A if and only if every neighbourhood of a contains at least one point b e A, b ^ a.
Lemma. A set A C K of numbers is open if and only if every point a £ A, has a neighbourhood contained in A.
296
CHAPTER 5. ESTABLISHING THE ZOO
Making use of the fact that the exponential function is continuous and injective on the whole of its domain (R), we can replace the limit
lim f{x)9^
with
lim (g(x)-\n f(x)) ex^x0
Let us remind that either of these limits exists if and only if the other one exists. Further,
lim (g{x) ■ mf(x)) = a G R lim (g(x) ■ lnf(x)) = +00 lim (g(x) ■ lnf(x)) = —00
x^rxo
Thus we can write
lim f{x)9^ = ea,
x—yxo
lim f(x)9^ = +00,
x—yxo
lim f(x)9^ = 0.
hm f{x)9(x> = ex^x° x^x°
x^rxo
if both limits on the right-hand side exist and do not lead to the indeterminate form 0 ■ 00. It is not difficult to realize that this indeterminate form can only be obtained in three cases, corresponding to the remaining indeterminate forms 0°, oo°, i00, when we have, respectively, that
lim f(x) = 0 x^rxo	and	lim g(x)	= 0,
lim f(x) = +00	and	lim g(x)	= 0,
lim f(x) = 1	and	lim g(x)	= ±00
In other cases, knowledge (and existence) of the limits
lim f(x),
X—IXq
lim g(x)
x^rxo
allows us to determine the result (having denned some more expressions)
lim f(x)9^ = ( lim f(x))
x^rxo \ X—YXq i
lim g(x)
Since
lim   f 2 H— ] =2,      lim   — = 0,      lim  x = +00,
x—\ X / x—X x—
we have that
Proof. Let A be an open set and let a e A. If there is no neighbourhood of the point a inside A, then there is a sequence an £ A, | a — an | < 1/n. But then the point a e A is a limit point of the set K \ A, which is impossible since the complement of A is closed.
Suppose every a e A has a neighbourhood contained in A. This prevents a limit point b of the set K \ A to lie in A. Thus the set K \ A is closed, and so A is open. □
From this lemma, it follows immediately that any union of open sets is an open set. A finite intersection of open sets is also an open set.
For real numbers, the ^-neighbourhood of a point a is the open interval of length 25, centered at a. In the complex plane, it is the disc of radius S, also centered at a.
5.2.7. Bounded and compact sets. The closed and open sets are basic concepts of topology. Without going into deeper connections, the above material describes the topology of the real line and the topology of the complex plane. The following concepts are extremely useful:
Bounded and compact sets
A set A of rational, real, or complex numbers is called bounded if and only if there is a positive real number r such that \z\ < r for all numbers z e A. Otherwise, the set is called unbounded.
A set which is both bounded and closed is called compact.
An interior point of a set A is a point such that one of its neighbourhoods is contained in A.
A boundary point of a set A is a point for which all its neighbourhoods have a non-trivial intersection with both A and its complement K \ A. A boundary point of the set A may or may not belong to it.
An open cover of a set A is such a collection of open sets Ui,i el, that its union contains the whole of A.
An isolated point of a set A is a point a e A such that there is a neighbourhood N of a satisfying N n A = {a}.
I (TTEVRENE,,
lim     2 + -      = 2U = 1,
X-S-+OQ \ x '
lim  x x =   lim     - I =0
x—>-+oo x—>-+oo \ x
lim  x~x =   lim   (xx)~1 = 0.
x—>-+oo x—>-+oo
The last result can be expressed as 0°° = 0 or 0000 = 00, 00_1 = 0 (let us emphasize that these are not indeterminate forms).
297
CHAPTER 5. ESTABLISHING THE ZOO
Although we have laid great emphasis on the reader to prefer reasoning about the limit behavior of functions to mindless labeling of the forms as determinate and indeterminate, it is, we hope, clear now why we will focus on the indeterminate ones. □
5.C.16. Calculate
lim
sin a; + 7ra
x-s-+oo 2 cos x — 1 — x2 ' 3X+1 + x5 - 4x
lim
x-s-+oo  3X + 2X + x2 '
Ax - 8a;6 - 2X - 167
lim -==-,
x^+oo 3* _ 452; _ v+IttH-iz
fx — sin3 x + x arctg x
lim ■
x—>-+oo
Vl + 2a; + :
Solution. Having reduced the first fraction by the polynomial a;2, we get
lim
sin x + Tlx
x-f+oo 2 cos x — 1 — a;2 Boundedness of the expressions
lim
+ 7T
sin x | < 1,    | 2 cos x — 1  < 3   pro   x £ and a;2^+oofora;^+oo give us the result
lim -
x—>-+oo
+ 7T
1
0 + 7T
0-1
In the last argumentation, we actually used the squeeze theorem and the notation c/oo = 0 which is valid for any eel (or bounded/oo = 0, where "bounded" denotes a bounded function).
This procedure can be generalized. Any limit of the form
fi(x) + f2(x) + ■ ■ ■ + fm(x)
lim
x^xo gi(x) + g2(x) H-----h gn(x) '
where
x^xo f^x)
lim £lM
x^xo gi\x)
= 0, i£{2,...,m}, = 0, ie{2,...,n},
fi(x) + f2(x) + ■ ■ ■ + fm(x)
lim
satisfies
i™o gi (a;) + g2 (x) H-----h gn (x)     x^Xx0 g^x)'
supposing the limit on the right-hand side exists. It is advantageous to realize (the third limit can be determined, for example, by l'Hospital's rule, with which we will make ourselves familiar later)
c xa x@
lim  — = 0,   hm —r = 0,   lim   — = 0,
x—y+oo xa x—y+oo xP x—>-+oo ax
plane.
5.2.8. Theorem. All subsets 4cK of real or complex numbers satisfy:
(1) a non-empty set A CM. is open if and only if it is a union of countably (or finitely) many open intervals; similarly A C C is open if and only if it is a union of countably (finite) many open circles.
(2) every point a G A is either an interior or a boundary point,
(3) every boundary point of A is either an isolated point or a limit point of A,
(4) A is compact if and only if every infinite sequence contained in it has a subsequence converging to a point in it.
A,1
(5) A is compact if and only if each of its open covers contains a finite subcover of A.
Proof. (1) Every open set is a union of some neighbourhoods of its points, i.e., we may consider open intervals in reals, or open circles in C. So the question that remains is whether it suffices to take countably many of them. Let us first prove the claim for the complex For each z e A, there is an open circle Og(z) contained in A, with some S > 0, and let Sz be the supremum of the values of such S. Clearly, A = Uz€aOsz(z). Consider an arbitrary z e A and pick up w with both real and imaginary parts rational, such that \w — z\ < <5z/4. Thus, z e 0$m (w) (draw a picture!) and we have checked that actually A is the union of the countably many open circles Ogm (w) for all w e A with rational real and imaginary coordinates.
If A is and open subset in R, then we may repeat the above argument with the circles 0$(z) replaced by the intervals 0${x) and x e A. Think about the details!
(2) It follows immediately from the definitions that no point can be both an interior and boundary point. Let a e A be a point that is not interior. Then there is a sequence of points di <£ A with a as its limit point. At the same time, a belongs to each of its neighbourhoods. Thus a is a boundary point.
(3) Suppose that a £ A is a boundary point but not isolated. Then, similarly to the reasoning from the previous paragraph, there are points a{, this time inside A, whose limit point is a.
(4) Suppose A c R is a compact set, i.e., both closed and bounded. Consider an infinite sequence of
),/, points a{ e A. A has both a supremum b and an infimuma. Divide the interval [a, b] into halves: [a, |(6 — a)] and [^(b — a),b]. At least one of them contains infinitely many of the terms a{. Select this half
This result for real numbers is usually referred to as the Bolzano-Weierstrass theorem. Karl Weierstrass was a famous German mathematician (1815-1897) and his name is linked to many theorems in Mathematics. BernardBolzano (1781-1848) was a Bohemian mathematician, logician, philosopher, theologian and Catholic priest working in Prague at the beginning of the 19th century. He laid the basis of rigorous mathematical analysis a few decades before all the theory was worked out by Weierstrass and others. In particular he was skeptical about the effective use of Leibniz's infinitesimals without the necessary rigour.
298
CHAPTER 5. ESTABLISHING THE ZOO
lim  — = 0, c£l,    0 < a < (3,    1 < a < b.
x—>-+oo bx
Hence we immediately have that
3X+1 + a5 - 4a      ,      3 ■ 3X lim -— =   lim - = 3,
x—>+oo   3X + 2X + X x—>+oo 3X
Ax - 8a;6 - 2X - 167       , 4X lim -==- = lim
x^+oo %x _ 45a, _ x^+oo -^TlTT12 ■ TTX
= —oo.
If we realize that
7T
lim arctg a = — > 1,
x—>-+oo 2
we will also obtain that
Jx — sin3 x + x arctg x x arctg x
lim ---= lim
z-H-°o        Vl + 2a; + :
lim arctg x = —.
□
5.C.17.   Determine the limits
v      (   1           1 1 hm--1---1--
n^oo^l-2    2-3 3-4 1 1
lim
+
(n-	- 1) ■ n
	1 '
	
n^oo \ yjn2 + 1 vVi2 + 2 Solution. Since for every natural number k > 2 it holds that (what we do here is called partial fraction decomposition - we will present it in detail in the chapter concerning integration of rational functions)
1
1 1
(k-l)k    k-l fc'
we get that
1
v      (   1           1 1 hm--1---1---1-----h ,
n^oo^l-2    2-3    3-4 (n-l)-n
(\     1     1     1     1     1 1 1
= lim----1-----1-----1-----1----
n^oo^l    22334 n-1 n
= lim (1 - - ] = 1.
n—s-oo y     n J
Let us remark that this limit is quite important: it determines the sum of one of the so-called telescoping series (with which Johann I Bernoulli (1667-1748) worked).
To determine the second limit, we invoke the squeeze theorem. The bounds
1 11 1
_l-----1_ > _|-----1_
and one of the terms contained in it; then cut the selected interval into halves. Again, select such a half which contains infinitely many of the sequence's terms and select one of those points. By this procedure, a Cauchy sequence is established. Cauchy sequences have limit points or are constant up to finitely many exceptions. Thus there is a subsequence with the desired limit. The fact that A is closed implies that the point obtained lies in A.
Now the other direction: if every infinite subset of A has a limit point in A, then all limit points are in A, and so A is closed. If A were not bounded, we would be able to find an increasing or decreasing sequence such that the differences of absolute values of adjacent numbers would be at least 1, for instance. However, such a sequence of points in A cannot have a limit point at all.
Finally, we have to deal with the general case A C C. The arguments of the latter implication remain the same. Thus we have to show that any sequence zn of complex numbers in A has got a limit point in A. Consider the sequences of real and imaginary parts, xn and yn. Since they both have to be in the bounded subsets A§i and A^ of the real and imaginary projections of A, there is a subsequence
= (2
,Vn
such that xn
x, y„
y with the
vVi2 +1
\/n2 + n     \/n2 + n
1 11
Vn2 + 1 Vn2 + n ~ y/n2 + 1 y/n2 + 1
n
Vn2 + 1
limits sitting in the closures of A^ and A^, by virtue of the already proved real case. Obviously, zUk —> z = (a, y), but the latter limit has to sit in A since A is closed.
(5) First, focus on the easier implication. That is, suppose that every open cover contains a finite subcover. It is required to prove that A is both closed and bounded. A C C can be covered by a countable union of neighbourhoods On(z), with integers n and centers z with integral real and imaginary parts. Any choice of a finite subcover of them witnesses that A is bounded. If A C K, then the same argument applies with intervals On(x), n, a G Z.
Now suppose that a e C \ A is the limit point of a sequence a{ e A. Further, assume that a—an \ < i (otherwise select a subsequence satisfying this property). The sets
Jn = C\01/n(a)
for all n G N, n > 0, are open and they also cover our set A. Since it is possible to choose a finite cover of A, the point a is inside the complement C \ A, including one of its neighbourhoods, and thus it is not a limit point. Therefore, all of A's limit points must again lie in A. Hence A is closed.
If A C K, the same argument applies with circles replaced by intervals.
Finally, we have to prove the other implication. So as-\ sume A C C is complete and bounded, but there is an open covering Ua, a G 7, of A, which does not contain any finite covering. Consider the sequence of positive real numbers en = 1/n converging to 0 and sets
B„ = {* = (£, ^) e A k,meZ}
299
CHAPTER 5. ESTABLISHING THE ZOO
for n G N give that
n^°° \/n2 + n    n^°° \\/n2 + 1
hm —, < hm I —, + ■ ■ ■ +
\/n2 + n
< lim
n^°° y/n2 + 1
Since
lim
lim _ _____
n^°°y/n2+n    "-»«1 Vn2
lim - = lim _= 1,
n^oo V?l2 _|_ I       n^oo Vn2
we also have that
lim |    , 1      +   , ^      + ■ ■ ■ +   ,  1       ] = 1.
n^oo \_ ^/n2 _|_ 1       y'n2 + 2
Vn2 + n
□
5.C18. Calculate (a)
lim
x-s-o a;
(b)
(c)
lim
cos x — sin x
x-s-ir/4    cos (2a;) '
of complex numbers with real and imaginary parts in the "1/n-net of coordinates". Clearly all sets Bn are finite. Further, for each k, consider the system Ak of closed circles with centres in the points of Bk and diameters 2ek. Clearly each such system Ak covers the entire set A. Altogether, there must be at least one closed circle C in the system A\ which is not covered by a finite number the sets Ua. Call it C\ and notice that diamC\ = 2e\.
Next, consider the sets C\ n C, with circles C e A2 which cover the entire set C\. Again, at least one of them cannot be covered by a finite number of Ua, we call it C2. This way, we inductively construct a sequence of sets Ck satisfying Cfc+i C Ck, diamCfc < 2ek, £k —> 0, and none of them can be covered by a finite number of the open sets Ua.
Finally we choose one point Zk £ Ck in each of these sets. By construction, this must be a Cauchy-sequence. Consequently, this sequence of complex numbers has a limit z. Thus there is Uao containing z and containing also some <5-neighbourhood Og(z). But now, if diam Ck < 2ek < S, then Ck C 0$(z) c Uao, which is a contradiction. The proof is complete when considering A c C.
Dealing with real subset A c K, again the same line of arguments applies, just the 2-dimensional nets Bk become 1-dimensional and the open circles are replaced by open intervals. □
lim   Va-1 ( \/x2 + 2x + 3 - \Va;2 + 2a; + 2
x—>-+oo V
Solution. We will calculate the wanted limits using the method of multiplying both the numerator and the denominator by a suitable expression. The first fraction can be conveniently extended by
Vl + X + y/l-x
and making use of the well-known formula (a — 6) (a + 6) = a2 — b2. Thus we obtain
y/l+X-y/l-X (1 + X) - (1 - X)
am- = lim
x-s-0 X
lim
x^O X (VI + X + VI - x)
2 1.
vi + x + Vl - x VT+VT
Similarly we can calculate
cos x — sin x            (cos x + sin x) (cos x — sin x) lim--—-— = lim------—--
x-s-ir/4   cos (2a;)       x-mt/4    (cos x + sin x) cos (2a;)
= lim
x-s-ir/4 (cos x + sin x) cos (2x)
1 1 V2
= lim
x-s-ir/4 cos x + sin X      VI I VI 2 2^2
The reduction was made thanks to the identity
cos (2a;) = cos2 x — sin2 x,    x £ I
5.2.9. Limits of functions and sequences. For the discussion of limits, it is advantageous to extend the ('/ set R of real numbers by the two infinite values feslr:^ ±oo as we have done when defining intervals. A neighbourhood of infinity is any interval (a, oo). Similarly, any interval (—00, a) is a neighbourhood of —00. Further, we will extend the concept of a limit point so that 00 is a limit point of a set A c R if and only if every neighbourhood of 00 has a non-empty intersection with it, i.e. if the set A is unbounded from above. Similarly for —00. We talk about the infinite limit points, sometimes also called improper limit points of the set A.
"Calculations" with infinities
We also introduce rules for calculation with the formally added values ±00 and arbitrary "finite" numbers a e R:
a + OO : a — OO :
a ■ 00 ■ a ■ 00 ■
a ■ (—00) :
a ■ (—00) a ±00
00
— 00
00, if a > 0
— 00, if a < 0
— 00, if a > 0 00, if a < 0
--= 0, for all a V 0.
300
CHAPTER 5. ESTABLISHING THE ZOO
As for the last limit, to make use of the formula
(a - b) (a2 + ab + b2) = a3 - b3, we consider the expression
{j(x2 + 2x + 3)2 + \/x2 + 2x + 3 ■ \/x2 + 2x + 2
+y(x2 + 2x + 2y, which corresponds to a2 + ab + b2, so we choose
a = \/x2 + 2a; + 3,       b = \/x2 + 2a; + 2.
By this extension, we transform the original limit to for some polynomials P, Q. Let us emphasize that this really holds for all n e N. For n = 1, we set Q) = 0 and the polynomials P, Q may be constant zeros. So we get for all real x:
(1 + 2nx)n = 1 + 2n2x + 2n3 (n - 1) x2 + P (x) x3, (1 + nx)2n = 1 + 2n2x + n3 (2n - 1) x2 + Q (x) x3.
Mere substitution and simple rearrangements give us
(1 + 2nx)n - (1 + nx)2n _
lim
x-s-0
lim
x-s-0
(2n3 (n - 1) - n3 (2n - 1)) x2 + (P(x) - Q(x)) x3
lim (-Ti3 + (P(x) - Q{x)) x) = -n3 + 0 = -n3
□
5.C.19. Calculate
lim (tan x
x—>tt/4
tan (2x)
Solution. Limits of the type l±OQ (like the examined one) can be calculated using the formula
lim f(x)9i-x> = ex^x°
x^rxo
supposing the limit on the right-hand side exists and/ (x) ^ 1 for all x of some deleted neighborhood of the point xq £ K. Therefore, let us determine
/ sin x      \ sin (2x) lim (tan x - 1) tan (2x) =  lim--1 -y—4
x-s-ir/4 x-s-ir/4 \COS2        y COS (2x)
= lim
sin x — cos x     2 sin 2 cos x
x^ir/i COS 2 COS2 2 — sin2 2
= lim
-2 sin 2
-2#
x-i-ir/4 COS 2 + Sill 2
2^2
i_   i _2
= -1.
Hence we have that
■.tan (2x) _
lim (tan 2)
x—ytv/a e
The following definition covers many cases of limit processes and needs to be thoroughly understood. Some particular cases are considered in great r detail below.
Real and complex limits
Definition. Consider a subset A c K and a real-valued function / : A —> R or a complex-valued function / : A —> C, denned on A. Further, consider a limit point x0 of the set A (i.e. a real number or ±00).
We say that / has limit a e K (or a complex limit a e C) at 20 and write
lim f(x) = a
if and only if for every neighbourhood 0{a) of the point a, there is a neighbourhood O(20) of 20 such that for all 2G An(O(x0)\{x0}), f(x)eO(a).
In the case of a real-valued function, a = ±00 can also be the limit. Such a limit is called infinite or improper. In the other case, i.e. a e K, we say the limit is finite or proper.
It is important to notice that the value of / at x0 does not occur in the definition, and that the function / may even not do We want 10 talk be denned at this limit point (and in the case of an improper *out*'e'Ki
r v r    r neighbourhood
limit point, it cannot be denned, of course)! \ Mof
„T    , 1    ,     . ,   . ...       c- .      r-        points where we are
We shall not deal with improper limits of complex func-interested in the
„   „ function values?
tions now.
5.2.10. The most important cases of domains. Our definition of a limit covers several very dissimilar situations:
(1) Limits of sequences. If A = N, i.e. the function / is denned for the natural numbers only, we talk about limits of sequences of real or complex numbers. In this case, the only limit point of the domain is 00, and we mostly write the values (terms) of the sequence as / (n) = an and the limit in the form
lim an = a.
n—too
According to the definition, this means that for any neighbourhood O(a) of the limit value a, there is an index N e N such that an e O(a) for all n > N. Actually, we have only reformulated the definition of convergence of a sequence (see 5.2.3). We have only added the possibility of infinite limits. As before, we also say that the sequence an converges to a.
We can easily see from our definition for complex numbers that a sequence of complex values has limit a if and only if the real parts of a_ converge to Re a and the imaginary parts converge to Im a.
(2) Limits of functions at interior points of intervals. If / is denned on the interval A = (a, 6) and x0 is an interior point of this interval, we talk about the limit of a function at an interior point of its domain. Usually, we write
lim f(x) = a.
Let us examine why it is important to require j(x) e 0(a) only for the points 2 ^ x0 in this case as well. As an example,
301
CHAPTER 5. ESTABLISHING THE ZOO
Let us remark that the used formula holds more generally for "the type iwhatever"; that is with no further conditions on the limit limx^Xo g(x) which even need not exist. □
5.C.20.   Show that
sin a; lim- = 1.
x-s-0 a
Solution. Let us consider the unit circle (especially its quarter lying in the first quadrant) and its point
[cos a, sin a], a G (0,7r/2). The length of the arc between the points [cos a, sin a] and [1, 0] is equal to a. So we apparently have
sin a < a,    a G ^0, — j .
The value tana is then the distance between the points [1, sin a/cos a] and [1,0]. We can see that (feel free to draw a picture)
a < tana,   a G ^0, ^ .
This inequality also follows from the fact that the area of the triangle with vertices [0,0], [1,0], [l,tana] is greater than the area of the considered circular sector. Altogether, we have obtained that
sin a /   7r \
sin a < a < -,    a G   0, — ,
2/
that is
1 < 1 >
1 / 7T\
-,    XG 0,-
>s a V   2 /
sin a cos
sin a f 11
- > cos a, a G   0, —
a V 2
Invoking the squeeze theorem, we get the inequalities
sin a
1 = lim 1 > lim - > lim cos a = cos 0 = 1.
x-s-o+      x-s-o+   a x-s-o+
Thus we have proved that
sin a lim -= 1.
X-S-0+ x
The function y = (sin a)/a denned for a ^ 0 is even, whence it follows that
sin a             sin a lim - = lim - = 1.
aj-t-o-   a       x-s-o+ a
Since both one-sided limits exist and have the same value, the examined limit exists as well and satisfies
sin a             sin a lim- = lim - = 1.
x-s-o   a       x-s-o± a
Let us remark that at first sight, one could say the limit can be calculated using l'Hospital's rule. However, then one would have to know the sine's derivative at zero which, actually, is the limit in question. Thus we may not invoke l'Hospital's rule in this case. □
let us consider the function /
/(*)
0 ifa^O
1 ifa = 0.
Apparently, the limit at zero is well-defined, and in accordance with our expectations, limx^0 f(x) = 0 even though the value /(0) = 1 does not belong into small neighbourhoods of the limit point 0.
An equivalent definition using e-neighbourhoods of the limits a and ^-neighbourhoods of the limit points a0 is the following: lima;_).a;0 /(a) = a if for each e > 0 there is a 8 > 0 such, that for all a x0 satisfying a — a0 < 8, |/(a) -a\<e.
(3) One-sided limits. If A = [a, b] is a bounded interval and a0 = a or a0 = b, we talk about a one-sided limit of the function / at the point a0, from the right and from the left respectively.
If the point a0 is an interior point of the domain of /, we can, in order to determine the limit, consider the domain restricted to [a0, b] or [a, a0]. The resulting limits are also called a right-sided limit and left-sided limit, respectively, of the function / at the point xq. We denote them by 1111^+ /(a) and lim^^- /(a), respectively. As an example, we can consider the one-sided limits at xq = 0 for Heaviside's function h from the beginning of this part. Apparently,
lim h(x) = 1,     lim h(x) = 0.
However, the limit limx^0 f(x) does not exist.
It follows from the definitions that the limit at an interior point of the domain of an arbitrary function f exists if and only if both one-sided limits exist and are equal.
5.2.11. Further examples of limits. (1) The limit of a complex function / : A —> C in a limit point a0 of its domain exists if and only if the limits of both the real part and the imaginary part exist. In this case, we have
lim /(a) = lim (Re/(a)) +i lim (Im/(a)).
The proof is straightforward and makes direct use of the definitions of distances and neighbourhoods of the points in the complex plane. Indeed, the membership into a (^-neighbourhood of a complex value z is guaranteed by the real (1 / v/2)^-neighbourhoods of the real and the imaginary parts of z. Hence the proposition follows immediately.
(2) Let / be a real or complex polynomial. Then for every point a G K,
lim /(a)
f(xo)
Really, if /(a) = anx
^k - xkn + kSx^r1
+ --- + ao, then the identity (a0 +
8)k = ag + Mag-1 H-----h 8k, substituted for k = 0,..., n,
gives that choosing a sufficiently small 8 makes the values arbitrarily close to /(a0).
302
CHAPTER 5. ESTABLISHING THE ZOO
5.C.21.   Determine the limits
lim
lim I 1 + -4-1   ,  lim II-1
71 + 1 /    ' n—s-oo V 712 )    ' n—s-oo V 71 I '
. sin2 a; a; arcsin x lim-,       lim —^—,       lim-,
3 tan2 x sin (3a;) tan (3a;)
lim —-—t.—,       lim ——t—t, lim
x-s-o   5 x2   '       x-s-o sin (5a;)'       x-s-o sin (5a;) '
lim-,       lim--——.
x-s-o      x aj-t-o sin (2a;)
Solution. When calculating these limits, we will use our
knowledge of the following limits (a G R):
/„     a\n      „         sina;     „         ex — 1 hm   1 + -    = ea, lim-= 1, lim-= 1.
n-s-oo V       71/ x-s-0    x i-s-o a;
Thus we know that
e_1 = lim | 1 — — |   = lim 71
n—s-oo \ 71 / n—s-oo \ Ti
The substitution m = n — l gives us
lim
71-1
lim
ri—s-oo V     71     / m—s-oo \ 771 + 1
771
m+1
771+1
lim ( - ]    ■ lim
771+1
Altogether, we have
e 1 = lim (-- I   ■ lim
m—s-oo \ 771 + 1 / m—s-oo 771+1
Clearly, the second limit is equal to 1. Changing the variables (replacing ti with m), we can write the result
e 1 = lim
71+1
Further, it holds that
lim I l + 4r |   = lim \ l + \
= lim      1 +
and
ti—S-oo \  \ 71
lim ( 1 - -1    = lim ((1 - -
n—s-oo \ 71 / n—s-oo \ \ 71
= eu = 1
Let us point out that the first result follows from the limits
/      1 \ "2           /      l\m 1 lim    1 + = lim    1H--= e,  lim - = 0
n—s-oo \ 71^ / m—s-oo \ 771 / n—s-oo 71
and the second one from
(   1V
lim (1--I   = e 1,        lim 71 = +00,
n—s-oo \ 71 / n—s-oo
where e~°° = 0 (= lim^-oo ex = 0).
(3) Now consider the following function denned on the whole real line
1 if x G 0   if x ff
It is apparent from the definition that this function cannot have (even one-sided) limits at any point of its domain.
(4) The following function is even trickier than the previous one. Let / : R —> R be the function denned as follows:8
/(*)
if a; = 0
if x = I G Q, p, q > 0 are co-prime,
if a; ^ Q.
Choose any point x0, no matter whether rational or irrational. Our goal is show that limx^Xo j(x) = v,i, 0. Thus fix any e > 0 and look at the possible values of j(x) close to x0. Notice that f(x) > e, i.e. ^ > e can be true for only finite number of q G N. This behaviour is illustrated on the diagram. In particular, there can be only a finite number of points x in the interval (a;0 — 1, x0 + 1) for which j(x) > e. Label them xi,..., xn. Finally, choose S smaller than the minimum of the distances of any two different points x{. Then j(x) < e for all x G Os(x0) \ {x0}. This finishes the proof.
Notice that this limit equals the function value only at the irrational points.
5.2.12. The squeeze theorem. The following result is elementary, but extremely useful. We meet it when (/ computing limits of all types discussed above, i.e. limits of sequences, limits of functions at interior points, one-sided limits, and so on.
Theorem. Let f, g, h be three real-valued functions with the same domain A and such that there is a deleted neighbourhood of a limit point x0 G R of the domain where
f(x) < g(x) < h(x). Suppose there are limits
lim f(x) = f0,     lim h(x) = h0
This function is called the Thomae function after the German mathematician J. Thomae, 1840-1921. You may find it under many other names too: e.g. Riemann function, pop-corn function, raindrop functionttc. It illustrates how badly dense the "discontinuity" points of a function can be even though it has limits everywhere.
303
CHAPTER 5. ESTABLISHING THE ZOO
We can easily get that
sin x                          sin a; hm- = lim sm x ■ lim- = 0-1 = 0.
x-s-0     x x-s-0 x-s-0 x
Apparently,
and the limit
lim -
x-s-o sin a;
lim
= l-x = l
1
x-s-o sin a;
does not exist (we write 1/ ±0). If we used the rule for the limit of a product to determine the limit
lim
we would obtain ll/±0 = l/±0. This means that the limit does not exist (this, again, is a determinate form). For the calculation of
arcsin x
lim
x-s-0 x
we will make use of the identity x = sin (arcsina;) which holds for any x G (—1,1), that is in some neighborhood of the point 0. Substituting y = arcsin x, we get
,.    arcsina; arcsina; y
lim- = lim---r = lim- = 1.
x-s-o     x        x-s-o sin (arcsina;) j/-*-osiny
Let us remark that y —> 0 follows from substituting x = 0 into y = arcsin x and from continuity of this function at 0 (this also guarantees that such a substitution can be made). We can immediately see that
3 tan2 x (3   sin x   sin x 1
lim —-—— = lim
x-s-o   5 a;2       x-s-o y 5     x       x     cos2 a;
3         sin a;         sin a; 1 = — ■ lim-■ lim-■ lim
5    x-s-0    x       x-s-0    x       x-s-0 cos2 x
=!•—!•
By appropriate extension and substitution, we get
sin (3a;) /sin (3a;)       5a; 3
lim ——t—t = lim 1
x-s-o sin (5a;)     x-s-o y    3a;      sin (5a;) 5
sin (3a;)             5a; 3 = lim-■ lim--—- ■ —
x-s-o    3a;      x-s-o sin (5a;) 5
siny          z     3 3 3
= lim-■ lim-- - = 1 ■ 1 - - = -.
jz-s-o   y     z-s-o sin z   5 5 5
Thanks to the previous result, it can easily be calculated that
,.    tan (3a;) /sin (3a;) 1
lim -;—r = lim
x-s-o sin (5a;)     x-s-o y sin (5a;)   cos (3a;)
sin (3a;) 1 = lim ——t-—r ■ lim
.......... =1.1 = 1.
x-s-o sin (5a;)   x-s-o cos (3a;)     5 5
and jo = ho. Then the limit lim i
x—S-XQ '
exists, and it satisfies go = jo = h
lim g(x) = g0
t£i mm
Proof. From the assumptions of the theorem, it follows that for any e > 0, there is a neighbourhood O(x0) of the point x0 G A c K in which both f(x) and h(x) lie in the interval (fo — e,fo + e), for all x =^ x0. From the condition f(x) < g(x) < h(x), it follows that g(x) G (fo — e, fo + e), and so \\rax^Xo g(x) = f0.
The above reasoning is easily modified for infinite limit values or for limits at infinite points x0. In the first case, choose a large N instead of e. The condition on the values reads: both f(x) and h(x) have values larger than N on the neighbourhood O(x0) \ {xq}, and thus the same will be true for g(x). In the second case, the neighbourhood O will be an interval (M, oo). The other infinite limit point — oo is dealt with similarly. □
The next theorem reveals the elementary properties of limits, again for all types together. Think about the individual cases, including the limits taken at x0 = ±oo!
5.2.13. Theorem. Let iclfe the domain of real or complex functions f and g, let x0 be a limit point of A and let the limits
lim f(x) = a e K,     lim g(x) = b G K
x—s-xo x—S-XQ
exist. Then:
(1) the limit a is unique,
(2) the limit of the sum f + g exists and satisfies
lim (f(x) + g(x)) = a + b,
x—S-XQ
(3) the limit of the product f ■ g exists and satisfies
lim (f(x) ■ g(x)) = a-b.
x—S-XQ
In particular, if f(x) = a is a constant function then
limT
a ■ g(x) = a ■ b,
(4) if b ^ 0, the limit of the quotient f jg exists and satisfies
f(x) = a g(x) b'
lim
x—s-xo
304
CHAPTER 5. ESTABLISHING THE ZOO
Similarly, we can determine
e^ _e-     ^    I 2x e(5-2)x _ 1
lim-= lim ( e
a-s-o     x a-s-o y      (5 — 2)a
ß3a _ ^
= lim e2x ■ lim-
x-s-o       x-s-o 3a;
(5-2;
■ 3
and also
lim
ey — 1
= e° ■ lim--3 = 1-1-3 = 3
y^O y
5a _1       e~x _l
lim
a-so sin (2a)      a-so \ sin (2a)     sin (2a)
= lim
1      2a 5
a-s-o y   5a      sin (2a) 2
eTx - 1       2a       ( 1 2
lim
lim
—a sin (2a) 2a 5
a-so    5a     a-so sin (2a) 2
e-x-l 2a — lim-■ lim--,—r .
a-so    —a      a-so sin (2a)   \ 2
5,. e"-l z 1,. e"-l z = — lim -■ lim —--1— lim-■ lim
2 ti-s-o    u      z-s-o sin z    2 v-*o    v      z-s-o sin 2: 5 1 2 + 2=3-
□
5.C.22.   Calculate the limits
1 — cos (2a)               1— cos a lim-;       hm-7.-.
a-so    asm a a-so a2
Solution. We will utilize the fact that
sin a lim- = 1.
a-so a
Then we get
1 — cos (2a) 1 — (cos2 a — sin2 a)
lim- = lim
a-so     a sin a a-so = lim ■
a-sO
(l — cos2 a) + sin2 a
2 sin a            sin a = lim- = lim 2-        = 2;
a-s-o a sin a     a-s-o a
and
1 — cos a /1 — cos a 1 + cos a lim-7:-      = lim -7.-■--
a-s-0       xz a-s-0 \       xz 1 + COS x
1 — cos2 a sin2 a
= lim ——-- = lim
a-s-o a2 (1 + cos a)     a-s-o a2 (1 + cos a) 2
= I lim- I   ■ lim
1 1
ya-s-0    a    /      a-s-0 1 + COS x 2
Let us remark that we could also use the identity
1 - cos (2a) = 2 sin2 a,       a G R.
□
Proof. (1) Suppose a and a' are two values of the limit tikJ®r linix^ao f(x)- If a 7^ a'< then there are disjoint neighbourhoods 0(a) and 0(ar). How-ever, for sufficiently small neighbourhoods of a0, the values of / should lie in both neighbourhoods. This is a contradiction. Thus a = a'.
(2) Choose a neighbourhood of a+b, for instance 02£ (a+ &). For a sufficiently small neighbourhood of a0 and i/io, both / (a) and g (a) will lie in e-neighbourhoods of the points a and b. Hence their sum will lie in the 2e-neighbourhood of a + b. The proposition is proved.
(3) Similarly to the above paragraph: choose Oe2(ab). For sufficiently small neighbourhoods of a0, the values of both / and g will lie in e-neighbourhoods of the values a and b. Therefore, their product will lie in the required e2-neighbourhood.
Clearly, the limit of the constant function /(a) = a is a at all limit points of its domain.
(4) In view of the previous results, it suffices to prove \ for b > 0. We need to be careful when
considering complex valued functions. We need to estimate
lim^a0 7^
1 1		b-g(x)
g(x) b		g(x) ■ b
Since b > 0, we may restrict ourselves to a neighbourhood U
I M 2
of ao such that \g(a
\g(x) — b\ < e, then
> |. Then \g(x) \ ■ \b\ >       Thus, if
1 1 g(x) b
This verifies the claim.
2|&-5(a)|
2e
w
□
5.2.14. Remarks on infinite values of limits. The statement of the theorem can be extended to some infinite values of the limits of real-valued functions: For sums, either at least one of the two limits must be finite or both limits share the same sign. Then the limit of the sum is the sum of the limits, with the conventions from 5.2.9. However, "oo — oo" is excluded.
For products, if one of the limits is infinite, then the other limit must be non-zero. Then the limit of the product is the product of the limits. The case "0 ■ (±co)" is excluded.
For a quotient, it may be that ael and b = ±oo, then the resulting limit will be zero; or a = ±oo and b G R, then it will be ±oo according to the signs of the numerator and the denominator. The case       is excluded.
The theorem also covers, as a special case, the corresponding statements about the convergence of sequences as well as about one-sided limits of functions denned on an interval.
The following provides a "convergence test" useful in many situations. It relates to limits of sequences and functions in general.
5.2.15. Proposition. Consider a real or complex valued function f defined on a set A C K and a limit point xq of the set A. f has a limit y at x0 if and only if for every sequence of
305
CHAPTER 5. ESTABLISHING THE ZOO
D. Continuity of functions
5.D.I. Let us examine existence of limits and continuity of the function (x — 1) ~sgn x at the points 0 and 1.
Solution. First, let us calculate the one-sided limits at the point 0:
linx^o- (x - I)" ^ = linx^o- (x - 1) = -1, lim^o^x - 1)-°*** = limM0+ ^ = -1,
whence lim (a; — l)_ssnx = —l. However, the function
x^O
value at 0 equals 1, so the examined function is not continuous at the point 0. Further, we have that
lim^i-ix - l)-s8na; = limMl- ^ = -oo, lim«iT(i - l)"sgnx = limx^1+       = oo.
Both one-sided limits at the point 1 exist, yet they differ, which implies that the (two-sided) limit of this function at 1 does not exist, and the function is not continuous here, either.
□
5.D.2. Without invoking the squeeze theorem, prove that the function
W     \0, i£l\{±n£N} is continuous at the point 0.
Solution. The function R is continuous at the point 0 if and only if
lim R(x) = R(0) = 0.
x—yQ
We will show that, by the definition of a limit, the examined limit equals 0. Let S > 0 be arbitrary. For any x g (—5, S) we have that R(x) = 0, or R(x) = x, hence (in both cases) we get R(x) g (—5, S). In other words, having chosen any ^-neighborhood (—<5, S) of the point a, we can take the ^-neighborhood (—<5, S) of the point x0 as then for any x g (—S,5) (the considered neighborhood of xq) it holds that R(x) g (—5, S) (here, the interval (—<5, S) is the neighborhood of a). This matches the definition of a limit (we did not even have to require x ^ x0).
The considered function R is called the Riemann function (hence the name R). In literature, it can be found in many modifications. For instance, the function
!1, xeZ; \,   x = \ g q; p, q relatively prime, q > 1; 0, x(jiQ
is also "often" called the Riemann function. □
points xn g A converging to x0, xn x0, the sequence of the values f(xn) has limit y.
T£sr KONVEX&ENCt.
I      I •
I I
t I
—4-1-1 J""" ♦
Proof. Suppose first that the limit of / at x0 is y. Then for any neighbourhood U of the point y, there is a neighbourhood V of x0 such that for all x g V n A, x ^ x0, j(x) g U. For every sequence xn —> x0 of points different from x0, the terms xn lie in V for all n greater than a suitable N. Therefore, the sequence f(xn) converges to y.
Now suppose that the function / does not converge to y six —> x0. Then for some neighbourhood U of y, there is a sequence of points xm =^ x0 in A which are closer to x0 than 1/m, with j(xm) not belonging to U. In this way, there is constructed a sequence of points lying in A different from x0,
with lim.
m—>-oo
= xq, for which the values f(xn) do not
converge to y. The proof is finished.
□
5.2.16. Continuity. Continuity was discussed intuitively when polynomials were discussed. Now all the tools for a proper formulation of continuity are prepared. This is the basic class of functions in the sequel.
Continuity of functions
Definition. Let / be a real or complex valued function defined on an interval A c K. / is continuous at a point x0 g A if and only if
lim f(x) = f(x0).
x—Yxo
The function / is continuous on an interval A if and only if it is continuous at every point x0 g A.
The diagram explains the meaning of continuity. Firstly, the limit has to exist. Thus, after choosing a neighbourhood U of the limit value f(xo) (the e-neighbourhood Oe(f(xo)) is shown), there is a neighbourhood of x0 (the ^-neighbourhood is shown), for which all images lie in U. In words, if we decide how close we want to be to f(x0), we always may choose
306
CHAPTER 5. ESTABLISHING THE ZOO
5.D.3. By denning the values at the points — 1 and 1, extend the function
2a - 1
1) sin -
X      ±1 (x £
so that the resulting function is continuous on the whole R.
Solution. The original function is continuous at every point of its domain. Thus the extended function will be continuous if and only if we set
2x - r
lim
1) sin
x2 - 1 2x - 1
/ 2x — 1 \
/(I)   := Hm^a2-!)^^].
If either of these limits did not exist (or were infinite), the function could not be extended to a continuous one. Clearly we have that
2x - 1
x2 - 1
< 1,   x    ±1 (x e
whence it follows that
- | x2 - 1 | < f(x) < I x2 - 1 I , a^±l(a£R). Since
lim   x2 - 1 I = 0, by the squeeze theorem, we get the result /(±1) := 0. □
5.D.4. Determine whether the equation e2x — a4 + 3a3 — Qx2 = 5 has a positive solution.
Solution. Let us consider the function
f(x) := e2x - x4 + 3a3 - 6a;2 - 5,   a > 0, for which
/(0) = -4,     lim /(a) =   lim e2x = +oo.
x—>-+oo x—>-+oo
From the fact that / is continuous on the whole domain it thus follows that it takes on all values y £ [—4, +oo). Especially, its graph necessarily intersects the positive semiaxis a, ie. the equation / (a) = 0 has a solution. □
5.D.5. At which points a £ R is the function
„cos(x+2)— x3 „21   ■        I e
y — cos I arctg     12xzl + 11
-11 - a12 + sin (sin (sina))
a sufficiently small neighbourhood of a0 where this is guaranteed.
Notice that for the boundary points of the interval A, the definition says that value of / equals the value of the onesided limit there. The function is said to be right-continuous or left-continuous at such a point. Every polynomial is a continuous function on the whole R, see 5.2.11(2). The Thomae function is continuous at irrational real numbers only although it has limits at all rational points as well, see 5.2.11(4). The previous theorem 5.2.13 about limit properties iZP . implies immediately all of the following claims. The same properties are true for M right-continuity or left-continuity, as is easily checked.
5.2.17. Theorem. Let f and g be (real or complex valued) functions defined on an interval 4cl and continuous at a point x0 £ A. Then
(1) the sum f + g is continuous at x0
(2) the product f ■ g is continuous at x0
(3) if g(x0) 0, then the quotient f /g is well-defined on some neighbourhood of x0 and is continuous at x0.
(4) If a continuous function h is defined on a neighbourhood off(x0) of the real-valued function f, then the composite function ho f is defined on a neighbourhood of x0 and is continuous at x0.
Proof. Statements (1) and (2) are clear. For property (3): if g(x0) ^ 0, then the e-neighbourhood of the number g(x0) does not contain zero for a sufficiently small e > 0. By the continuity of g, it follows that on a sufficiently small <5-neighbourhood of the point a0, g is non-zero and the quotient f/g is thus well-defined there. It is continuous at a0 by the previous theorem.
(4) Choose a neighbourhood O of h(f(x0)). By the continuity of h, there is a neighbourhood O' of f(xo) which is mapped into O by h. The continuous function / maps some sufficiently small neighbourhood of the point xq into the neighbourhood C. This is the definition property of continuity, so the proof is finished. □
(considering maximum domain) continuous?
o
307
CHAPTER 5. ESTABLISHING THE ZOO
5.D.6. Determine whether the function
a < 0; 0 < x < 1;
x = 1; 1< x < 2; 2 < x < 3; a; > 3
is continuous;  left-continuous;  right-continuous at the
points — 7t, 0, 1, 2, 3, 7t. O
i
x-3'
5.ZJ.7. Extend the function
/(a) = arctg ( 1 +     ) ■ sin" ab,      a G R \ {0}
at a = 0 so that it is continuous at this point.
5.D.8. Find all p G R for which the function
sin (6a
/(*) =
3a
a G
{0}; f(0)=P
is continuous at the origin 9. Choo
h{x) =
5.D.9. Choose a real number a so that the function
a4 - 1
is continuous on R.
a > 1;       h (a) = a,    a < 1
5.D.10. Calculate
lim
x-s-0+ aJ
lim
o
o
o
o
5.D.11. Find all possible values of the parameter a G R so that the inequality
(a - 2)a2 - (a - 2)a + 1 > 0
holds for all real numbers a.
Solution. We can notice that for a = 2, the inequality holds trivially (there is constant 1 on the left side). For a ^ 2, the left side is a quadratic function /(a) in the variable a, and further /(0) = 1. Thanks to the function /(a) being continuous, the inequality /(a) > 0 will hold for all real a if and only if there is no solution to the equation / (a) = 0 in R (the whole of the graph of the function / will then be "above" the a-axis). This will occur if and only if the discriminant of the quadratic equation (a — 2)a2 — (a — 2)a + 1 = 0 (in a) will be negative. Thus we get the following necessary and sufficient condition:
D = (a - 2)2 - 4(a - 2) = (a - 2)(a - 6) < 0.
This is true for a G (2,6). Altogether, the inequality holds for all real a iff a G [2,6). □
5.2.18. We consider some basic relations between continuous mappings and the topology of the real numbers. They exploit the highly non-trivial characterization of compact sets in Theorem 5.2.8.
Topological characterization of continuity
Theorem. Let / licl^ltea continuous function defined on an interval A. Then
(1) f is continuous if and only if the inverse image /_1 (U) of every open set U C R is an open set,
(2) the inverse image f~1(W) of every closed set W CM. is a closed set,
(3) the image f(K) of every compact set K C A is a compact set,
(4) f attains both its maximum and its minimum on every compact set K.
Proof. (1) Consider a point a0 G /_1(C/). There is a neighbourhood O of f(xo) which is contained in U since U is open. Hence there is a neighbourhood O' I of a0 which is mapped into O and thus is contained 1 in the inverse image. Therefore, every point of the inverse image is an interior point, which finishes the proof.
Conversely, if /~1 (U) is open for each open U, then taking any e-neighbourhood of /(a0), its pre-image will be an open neighbourhood of a0 satisfying the condition from the definition of the continuity.
(2) Consider a limit point a0 of the inverse image f~1(W) and a sequence x{, f(xi) G W, which converges to a0. From the continuity of /, it follows that /(a^) converges to /(a0) (cf. the convergence test 5.2.15). Since W is closed, /(a0) G W. Thus, all limit points of the inverse image of the set W are contained in W.
(3) Choose any open cover of f(K). The inverse images of the particular intervals are unions of open intervals and thus create a cover of the set K. Select a finite subcover from it. Then finitely many of the corresponding images cover the original set / (K).
(4) Since the image of a compact set is again a compact set, the image must be bounded and it contains both the supre-mum and the infimum. Hence it follows that these must also be the maximum and the minimum, respectively. □
5.2.19. There are two very useful consequences of the previous theorem.
308
CHAPTER 5. ESTABLISHING THE ZOO
5.D.12.   In R, solve the equation
2X + 3X + Ax + 5X + 6X = 5.
Maxima and minima of continuous functions9
Corollary. Let f : R —> R be continuous. Then
(1) the image of every interval is again an interval,
(2) f takes all the values between the maximal and the minimal one on the closed interval [a, &1.
Solution. The function on the left side is a sum of five increasing functions on R, so it must be increasing as well. For x = 0, its value is 5, which is thus the only solution of the equation. □
5.D.13. In R, solve the equation
2X + 3X + 6X = 1.
o
5.D.14. Determine whether the polynomial
P(x) = x37 + 5x21 - 4a;9 + 5a;4 - 2a; - 3 has a real root in the interval (—1,1). O
E. Derivatives
First of all, let us show that the derivatives enlisted in the table of paragraph 5.3.1 are correct. We will derive them right from the definition of a derivative.
5.E.I. From the definition, (see 5.3.1) find the derivatives of the functions xn (x is the variable, n is a constant positive integer), sjx, sin a;.
Solution. First, let us remark that by substituting h for x-in the definition of a derivative, we get
x0
lim
X—YXq
f(?) - /fro)
X — Xq
lim
h-tO
f(x0 + h)- f(x0)
h
In the following calculations, we will work with the latter expression of the limit.
Proof. (1) Consider an open interval A and suppose there is a point y e R such that f(A) contains points less than y as well as points greater than y, but y £ f(A). Put B\ = (—oo,y) and B2 = (y,oo). These are open sets, and the union of their inverse images A1 = f~1(B1) c A and A2 = f~1(B2) c A contains A. A\ and A2 are open, disjoint, and they both have a non-empty intersection with A. Thus there is a point x e A which does not lie in A\ but is a limit point of A1. It is in A2, which is impossible for two disjoint open sets.
Thus it is proved that if there is a point y which does not belong to the image of the interval, then either all of the values must be above y or they all must be below. It follows that the image is again an interval. Notice that the boundary points of this interval may or may not lie in the image.
If the domain interval A contains one of its boundary points, then the continuous function must map it to a limit point or an interior point of the image of the interior of A. This verifies the statement.
(2) This statement immediately follows from the previous one (and the above theorem) since the image of a closed bounded interval (i.e. a compact set) is again a closed interval. □
5.2.20. We conclude this introductory discussion by two more theorems which provide useful tools for calculating limits. Notice that we assume that functions are defined on all of R. Actually we are only interested in / on a neighbourhood of one point a, while g has to be defined on a neighbourhood of one point b only.
(*")' = lim h—yO
lim
h-tO
(x + h)n - xn
h
n~1h+ (nAxn-
nxn    + lim
h-tO
2h2 + --- + hn
2h +
„n-3 1,2
hZ + ■ ■ ■ + hn~
Thls result is usually called the Weierstrass theorem, but it is also known (especially in Czech literature) as the Bolzano's theorem. Bernard Bolzano worked in Prague at the beginning of the 19th century and apparently used such the result as a technical lemma when proving his Bolzano-Weierstrass theorem mentioned earlier.
309
CHAPTER 5. ESTABLISHING THE ZOO
(Vx~)' = lim
\/x + h — y/x
h^o h
= lim
(y/x + h — ^/x) (y/x + h + ^/x)
h^o h(\/x + h +
= lim
h
= lim
1
h^o h(Vx + h + y/x)     h^o y/x + h + ^x 1
., sin(a; + h) — sin x
smi   = lim-;-
h^o h
sin x cos h + cos x sin h — sin x h
= lim
h-tO
cos x sin h sina;(cos h — 1)
= lim-:--h lim-:-
h-tO        h h^O h
,     sink     , 2(sin|)2 = cos a; ■ lim —;--lim
h-tO   h       h^o h sini
= cos x ■ 1 + lim sin t-
t-t-o t
□
5.E.2. Differentiate:
i) x sin x,
ii) Sin^VO
iii) m(x+\/x2— a2), a V 0, \x\ > \a\,
iv) arctan ^-7==^, |a;| < 1, v) xx, x > 0.
Solution, (i) By the formula for the derivative of a product (the Leibniz rule, see 5.3.4) we get
(x sin a;)' = x' ■ sin a; + x ■ (sin a;)' = sin a; + x cos x.
(ii) By the formula for the derivative of a quotient (5.3.5), we have that
(sina;)' ■ x —
sin x ■ x      x cos x — sin x
(iii) This time, we will use the formula for the derivative of function composition (the chain rule, see 5.3.4). Setting
h(x) = ln(x), j(x) = x + y/x2 — a2, we obtain
ln(x + Vx2 - a2)' = h(f(x))' = h'{f{x)) ■ f(x)
(x + \J x2 — a2)'        1 + xix_ a2
x + \/x2 — a2       x + \/x2 — a2' where we used the chain rule once again when differentiating
Limits of composite functions
Theorem. Let f,g  :  R —> R be two functions and
\\mx^af{x) = b.
(1) If the function g is continuous at the point b, then
lim g (f(x)) = g (lim f(x)) = g(b).
(2) If the limit Yva\y^b g{y) exists and j'(x) =/ b holds for allx V a from some neighbourhood of the point a, then
lim g (f(x)) = lim g(y).
Proof. The first proposition can be proved similarly to 5.2.17(4). From the continuity of g at the point b, it follows that for any neighbourhood V of the value g(b), we can find a sufficiently small neighbourhood U of the point b whose values of g lies in V. However, if / has limit b at the point
a, then / will hit U by all its values j(x) for x =/ a from some sufficiently small neighbourhood of the point a, which verifies the first statement.
(2) Even if we cannot use the continuity of g at the point
b, the previous reasoning will be true if we ensure that all points x V a from sufficiently small neighbourhoods of a are mapped into a neighbourhood of b by the function /, but j(x) V b for all such points. □
5.2.21. Who or what is in the ZOO. We have begun to build a menagerie of functions with polynomials and functions which can be created from them "piece-wise". Moreover, we have derived many properties for a huge class of continuous functions. However, except for polynomials, we do not have many practically manageable examples at our disposal. We consider the quotients of polynomials.
Let / and g be two polynomials in the real variable x with complex coefficients a, gC.
The function /i:I\{i£l, g(x) = 0} C,
h(x) =
y{x)
\/x2 — a2.
is well-defined for all real x except for the roots of the polynomial g. Such functions are called rational functions. From theorem 5.2.17, it follows that rational functions are continuous at all points of their domains. At the points where they are undefined, they can have
310
CHAPTER 5. ESTABLISHING THE ZOO
(iv) Again, we are looking for the derivative of a composed function:
arctan
l—xz
1-x2
vT
xz +
1
(v) First, we transform the function to a function with constant base (preferably the base e) which we are able to differentiate.
^(xln^'-e^^^^^ + lna;)-^
□
5.E.3. Find the derivative of the function y = xslnx, x > 0. Solution. We have
[xsinx) = (esinx lnx) =esinx lnx ( cosz lna; +-
cos a; In a; +
□
5.E.4. For positive x, determine the derivative of the function
f(x) = xl*x O
5.E.5. For x e (0,7r/2), calculate the derivative of the function y = (smx)cosx. O We advise the reader to make up some functions and find their derivatives. The results can be verified in a great deal of mathematical programs. In the following exercise, we will look at the geometrical meaning of the derivative at a given point, namely that it determines the slope of the tangent line to the function graph at the given point (see 5.3.2).
5.E.6.   Using differential, approximate arccotg 1.02. Solution. The differential of the function / having continuous derivative at the point x0 is equal to
/' (x0) dx = f (x0) (x - x0). The equation  of the tangent to  /'s  graph  at the point [xq, f(xo)] is then
V - f (xo) = f (x0) (x - x0). Hence we can see that the differential is the growth on the tangent line. However, the values on the tangent approximate those of /, supposing the difference x — x0 is "small". Thus
• a finite limit,
• an infinite limit, supposing the one-sided infinite limits are equal,
• different one-sided infinite limits.
For the case of a finite limit, it is necessary that the point is a common root of both / and g and that its multiplicity in / is at least as large as in g. Then the function's domain by this point can be extended by denning it to take the value of the limit there. Then the function is continuous at the same point.
The possibilities are illustrated in the diagram showing the values of the function
h(x) =
(x - 0.05a)(a; - 2 - 0.2a)(x - 5)
x(x -2)(x- 4)
for a = 0 on the left (thus it is the rational function (x
5)/(x — 4)) and for a = 5/3 on the right.
5.2.22. Power functions and the exponential. We have met the simple power functions x i-> xn with natural exponents n = 0,1,2,... when building the polynomials. The meaning of the function
x h> x
denned for all x     0, is also clear.
Now, we extend this definition to a general power function xa with an arbitrary a£l,
We use the well known properties of powers and roots. For a negative integer —a, we define
x~a = (x11)-1 = (a;"1)".
Further, we want the equality bn = x for n e N to imply that b is the n-th root of x, i.e. b = x " . We verify that such &'s always exist for positive real numbers x.
By factoring out y2 — yi in ~ Vi< or otherwise, the function y i-> yn is strictly increasing for y > 0. Choose x > 0 and consider the set B = {y e K, y > 0, yn < x}. This is a non-empty set bounded from above, so set b = sup B. A power function with natural exponent n is continuous, bn = x. Indeed, surely bn < x. if the inequality is strict, then there is a number y such that bn < yn < x, which implies that b < y. This contradicts the definition of & as a supremum.
Thus the power function is suitably denned for all rational numbers a = 2. For x > 0, we set xa = (xp)v = (x* )p.
Finally, we notice that for the values 0 < a e Q and fixed x > 1, xa is strictly increasing for rational a's. Therefore, for
311
CHAPTER 5. ESTABLISHING THE ZOO
we obtain the following formula for approximating the function value by its differential:
f(x) « / (x0) + f (x0) (x - x0).
So, setting
j(x) := arccotg a;,   x0 := 1,
we get
arccotg 1.02 « arccotg 1 +        (1.02 - 1) = f - 0.01.
Eventually, let us remark that the point xq is of course selected so that the expression x — x0 is as close to zero as possible, yet we must be able to calculate the values of / and /' at the point. □
5.e.7. Using differential, approximate arcsin 0.497. O
5.e.8. Using differential, approximate a = arctan 1.02 and b = </70. O
5.e.9. Using differential, approximate a) sin (fff) b)sin(^).
O
5.E.10. For which a e R is the cubic polynomial P which satisfies the conditions P(0) = 1, P'(0) = 1, P(l) = 2a + 2, P'(l) = 5a + 1, a monotonie function on the whole R?
Solution. From the conditions P(0) = 1 and P'(0) = 1 it follows that P(x) = bx3 + cx2 + x + 1 where b, c e R; the two remaining conditions determine two equations for the variables b and c: b + c + 2 = 2a + 2, 3b + 2c + 1 = 5a + 1 with the unique solution b = c= a. The polynomials which satisfy the desired conditions are thus of the form P(x) = ax3 + ax2 + x + 1, a G R. The monotonicity of the polynomial is equivalent to having no local extrema. The externa can occur only at those points where the derivative changes sign. Therefore, the polynomial is monotonie if and only if its derivative keeps the sign on the whole R. The derivative is
P'(x) = 3ax2 + 2ax + 1
and it will keep the sign iff the discriminant is non-positive. Thus we get the condition
4a2 - 12a < 0 4a(a-3)   < 0,
which is true for a e [0,3]. However, for a = 0 the polynomial P is monotonie, yet not cubic, Thus the set of satisfactory numbers a is the interval (0, 3]. □
general 0 < a e R and 1 < x we can define
xa = swp{xy,y e q, y < a}.
As before, x~a = -A-.
For 0 < x < 1, proceed analogously with care for the inequality signs, or seta;" = (^)~a- For a; = 1, define la = 1 for any a, while 0a = 0.
Now, the power function x i-> xa is defined for all x e [0, oo) and ael, There is another view of the construction: For every fixed real number c > 0, there is a well-defined function y i-> cy on the whole real line. This function is called the exponential function with base c. The definition ensures that the resulting function cy is continuous in y for fixed c and is continuous in c for fixed y.
The properties used when defining the power function and the exponential function f(y) = cv, with /(0) = 1, can be summarized in a single inequality for any real x and real
y-
(i)
f(x + y) = f(x)-f(y)
together with the condition of continuity.
Indeed, for y = 0/(0) = 1, and hence 1 = /(0) = f(x—x) = j(x) ■(f(x))~1. Eventually, for a natural number n,f(nx) = (f(x))n. Thus ax is determined for all a > Oand x e Q. The continuity condition determines the function's values uniquely at the remaining points, if such a function exists. So we should check, that the above constructed function has the property 1. This should be clear from the continuity of the operations of sum and product in both arguments - the details are left to the reader.
Thus, the exponential function satisfies both the well-known formulas
(2)
ax ■ ay = ax+y,    (ax)y = ax-y.
5.2.23. Logarithmic functions. The exponential function j(x) = ax is increasing for a > 1 and decreasing for 0 < a < 1. Thus in both cases, there is a function f~1(x) inverse to it. This function is called the logarithmic function with base a. We write lna(a;). lna(ax) = x is the defining property.
The equalities (2) are thus equivalent to
lna(a; ■ y) = \aa{x) + \na(y),    \na(xy) = y ■ \na(x).
312
CHAPTER 5. ESTABLISHING THE ZOO
5.E.11. Determine the parameter celso that the tangent line to the graph of the function ^7=^ at the point [1,0] goes through the point [2,2].
Solution. From the statement of the problem it follows that the tangent's slope is §5j = 2. The slope is determined by the derivative at the given point, thus we get the condition
2 - ln(cx)
2x^x~ x=1
= 2, that is   2 - ln(c) = 4,
hence c = \. Yet for c = \, the function M££l takes the value —2 at the point 1. Therefore, there is no such c. □
5.E.12. A highway patrol helicopter is flying 3 kilometers above a highway at the speed of 120 kph. Its pilot localizes a car whose straight-line distance from the helicopter is 5 kilometers. The car goes in the opposite direction and approaches the helicopter at 160 kph (with regard to the helicopter). Determine the car's speed with regard to a tin lying on the highway.
Solution. For the sake of simplicity, we will omit units of measurement (distances will be expressed in kilometers and times in hours, speeds in kph, then). The helicopter's position at time t can be expressed by the point [y(t), 3], and the car's position by [x(t), 0], then. (We choose the axes so that the helicopter and the car are moving along the x-axis.) Let us denote by s(t) the straight-line distance of the car from the helicopter and by t0 the moment mentioned in the problem's statement. Let us calculate the car's speed with respect to the origin. We can suppose that x(t) > y(t) > 0, thena;'(t) < 0, y'(t) > 0 for the considered time moments t since the car is approaching the point [0, 0] from the right - the value x(t) decreases as t increases, therefore x'(t) < 0. Similarly we can get that y'(t) > 0 and also s'(t) < 0. Let us add that, for instance, y' (t) determines the rate of change of the function y at time t, i. e. the helicopter's speed We know that
s(t0)=5,   s'(t0) = -160,   y' (t0) = 120 and that (s(t) is the hypotenuse of the right triangle)
(1)
Hence it follows (x(t) > y(t) > 0) that
(z(t)-y(t))2 + 32 = S2(t).
(x (t0)-y(t0))2 + 32
i. e.   x (t0) - y (t0) = 4.
By differentiating the identity (1), we get
2(x(t)-y(t))(x'(t)-y'(t)) = 2s(t)s'(t) and then for t = t0,
Logarithmic functions are defined only for positive arguments and they are, on the entire domain, increasing if a > 1 and decreasing for 0 < a < 1. Moreover, lna(l) = 0 holds for every a.
There is an extremely important value of a, namely the number e, see the paragraph 5.4.1, sometimes known as Eu-ler's number. The function lne(a;) is called the natural logarithm and is denoted by m(x) (i.e. omitting the base e).
3. Derivatives
When talking about polynomials, the rate at which the function changes at a given point of its domain was already discussed (see the para-•=k*^2|a_r graph 5.1.6). It is the quotient 5.1.6(1), which expresses the slope of the secant line between the points
[x,f(x)
and [x + Sx, f(x + Sx)} e K for a (small)
change Sx of the input variable. This reasoning is correct for any real or complex function /. It is only necessary to work with the concept of the limit, instead of the intuitive "small change" Sx.
Derivative of a function of a real variable
5.3.1. Definition. Let / be a real or complex function defined on an interval 4cK and x0 £ A. If the limit
i-Ho       x — Xo
exists, the function / is said to be differentiable at xq, provided a is finite. The value of the derivative at x0, namely a, is denoted by f'(x0) or £(z0) or -£f(x0).
If a is finite, the derivative is also sometimes called proper. If a is infinite, it is improper.
If x0 is one of the boundary points of A, we arrive at one-sided derivatives (i.e. left-sided derivative and right-sided derivative).
If a function has a derivative at xq, the function is said to be differentiable atx0. A function which is differentiable at every point of a given interval is said to be differentiable on the interval.
X4A*
Obviously, the derivative of a complex valued function (f(x) + i g(x)) exists if and only the derivatives of both real
313
CHAPTER 5. ESTABLISHING THE ZOO
2-4(x' (t0) - 120) = 2-5-(-160),   i.e.   x' (t0) = -80.
We have calculated that the car is approaching the tin at 80 kph. It suffices to realize with which units of measurement we worked. Having obtained a negative value is caused by our choice of the coordinate system. □
5.E.13. A 13 feet long ladder is leaned against a house. Suddenly the base of the ladder slips off and the ladder begins to go down (still touching the house at its other end). When the base of the ladder is 12 feet from the house, it is moving at 5 feet per second from it. At this moment:
(a) What is the speed of the top of the ladder?
(b) What is the rate of change of the area of the triangle delimited by the house, the ladder, and ground?
(c) What is the rate of change of the angle enclosed by the ladder and the ground?
o
5.E.14. Determine whether there is a point in the interval (0,4) such that the tangent line at this point to the polynomial x (x — 4)5 is parallel to the x-axis. O
5.E.15. Letp G (0, +oo). Write the equation of the tangent to the parabola 2py = x2 at a general point [xq, ?]. O
5.E.16. Find the equation of the normal line to the graph of the function y = 1 — e"?, x G R at the point where the graph intersects the x-axis. O
5.E.17. Find the equations of the tangent and normal lines to the curve y=(x + l) ^3 - x, x G R, at the point [-1, 0].
o
5.E.18. Let the function y = H^3^*2-*); x e (I,+oo) be given. Determine the equations of the tangent and normal lines to the graph of this function at the point [1, ?]. O
5.E.19. At which points is the tangent to the parabola
y = 2 + x — x2, x G R, parallel to the x-axis? O
5.E.20. Determine the equations of the tangent line t and the normal line n to the graph of the function
y = Vx2 -3x + 11, xeR
at the point [2, ?]. Further, determine all points at which the tangent is parallel to the x-axis. O
5.E.21. What is the angle between the x-axis and the graph of the function y = In a;? (We mean the angle between the
and imaginary parts / and g exist (see the elementary properties of limits). Then
(f(x)+ig(x))' = f(x) + ig'(x).
Derivatives are handled rather easily, but it takes time to derive the proper formulae for derivatives of the elementary functions in the zoo. Therefore, we present a table of derivatives of several such functions in advance. In the last column, there are references to the corresponding paragraph where the result is proved. Notice that even though we are unable to express inverse functions to some of our functions by elementary means, we can calculate their derivatives; see 5.3.6.
Derivatives of some functions
function	domain	derivative	
polynomials	whole R	fix) is again a polynomial	5.1.6
cubic splines h(x)	whole R	only the first derivative h'(x) is continuous	5.1.9
rational functions f(x)/9(*)	whole R, except for roots of g	rational functions: /' 0)90)-/0)9'0)	5.3.5
power functions fix) = xa	interval (0,oo)	fix) = ax11'1	5.3.7
exponential functions f{x)   = ax, a > 0, a ^ 1	whole R	fix) = ln(aK	5.3.7
logarithmic function fix) = lna(z), a > 0, a ^ 1	interval (0,oo)	/'(*) = (In(a))"1 ■ I	5.3.7
The initial idea of the definition suggests that fix0) allows an approximation to the function / by the straight line
y = f(x0) + f'(x0)(x - x0).
This is the meaning of the following lemma, which says that replacing the constant coefficient fix0) ,i-rfT    in the line's equation with a certain continuous ^Sltx func1;ion gives exactly the values of /. The dif-ference between^ (x) and^zo) on a neighbourhood of x0 then says how much the slopes of the secant lines and the tangent line at x0 differ.
Lemma. A real or complex function f(x) has a finite derivative at xq if and only if there is a neighbourhood O ofxo and a function ip defined on this neighbourhood which is continuous at x0 and such that for all x G O,
fix) = fix0) + ip{x){x - x0).
Furthermore, ip{xo) — fixo), and f is continuous at the point x0.
314
CHAPTER 5. ESTABLISHING THE ZOO
tangent line and the positive a-axis in the "positive sense of rotation", ie. counterclockwise.) O
5.E.22. Determine the equations of the tangent and normal line to the curve given by the equation a3 + y3 — 2xy = 0 at the point [1,1]. O
Proof. If ip exists, it is of the form
5.E.23. Prove: ^ < In (1 + x) < x for all x > 0.
F. Extremal problems
o
The simple observation 5.3.2 about the geometrical meaning of the derivative also tells us that a differentiable real-valued function of a real variable can have extremes only at the points where its derivative is zero. We can utilize this mere fact when solving miscellaneous practical problems.
5.F.I. Consider the parabola y = x2. Determine the incoordinate xa of the parabola's point which is nearest to the points = [1,2].
Solution. It is not difficult to realize that there is a unique solution to this problem and that we are actually looking for the absolute minimum of the function
f(x) = sj{x - l)2 + (x2 - 2)2, xeR. Since taking the square root is an increasing function, the function / takes the least value at the same point where the function
g{X) = (x- l)2 + (x2 - 2)2, xeR
does. Since
g'{x) = Ax3 - 6x - 2,   xe R, by solving the equation 0 = 2a3 — 3a — 1, we first get the stationary point x = — 1 and after dividing the polynomial 2x3 — 3a — 1 by the polynomial x + 1, we then obtain the remaining two stationary points
l-y/3
and
l+v/3
2 2
As the function g is a polynomial (differentiable on the whole domain), from the geometrical sense of the problem, we get
T „   -   1 +
xa —      2 ■
□
5.F.2. Consider an isosceles triangle with base length b and height (above the base) h. Inscribe the rectangle having the greatest possible area into it (one of the rectangle's sides will lie on the triangle's base). Determine the area S of this rectangle.
ip(x)
(/(a) - f(x0)
x — Xq
for all a; £ O\{x0}.
Suppose f'(x0) is the proper derivative. Then we can define the value at the point x0 as ip(x0) = /' (a0). Certainly,
lim ip(x) = f'(x0) = ip(x0).
x—lxo
Thus ip is continuous at xq as desired.
On the other hand if such a function ip exists, the same procedure calculates its limit at xq . Thus the derivative f'(xo) exists as well and equals ip(x0).
The function / is expressed in terms of the sum and product of functions continuous at x0. Thus / is continuous at
x0. □
5.3.2. Geometrical meaning of the derivative. The previous lemma leads to a geometric interpretation of the derivative in terms of the slope of the se-ytz~i_ cant lines of the graph of / through [x0, f(x0)]. The derivative exists if and only if the slope of the secant line through the points [x0, f(x0)] and [a, /(a)] changes continuously when approaching the argument x = xq . If so, the limit of this slope is the value of the derivative. This observation leads to the important corollary:
Functions increasing and decreasing at a point
Corollary. If a real-valued function f has derivative f'{xo) > 0 at a point xq £ R, then there is a neighbourhood O(x0) such that f(x) > f(x0) for all points
x £ O(xo), x > xq, and f(x) < f(xo) holds for all
x £ O(x0), x < Xq.
On the other hand, if the derivative satisfies f'(xo) < 0, then there is a neighbourhood O(x0) such that f(x) < f {xq) for all points x £ O{x0), x > x0, and f(x) > f(x0) for all x £ O(x0), x < x0.
Proof. Suppose / is increasing at x0. By the previous lemma,/(a) = j(xo)+ip(x)(x — xq) andip(xo) > 0. Since ip is continuous at x0, there exists a neighbourhood O(x0) on which ip(x) > 0. If x increases, x > xq, j(x) increases as well, j(x) > f(x0). Analogously for x < x0. The case with a negative derivative is proved similarly. □
A function is called increasing at xq of its domain, if for all points x of some neighbourhood of a point x0, f(x) >
f(xo) if x > xq and j(x) < f(xo) if x < xq. A function is increasing on an interval A if f(x) — f(y) > 0 for all x > y, x,y £ A.
Similarly, a function is said to be decreasing at a point x0 if and and only if there is a neighbourhood of the point xq such that/(a) < f(x0) for all a; > x0, while j(x) > f(x0) for all x < xq from this neighbourhood. A function is decreasing on an interval A if f(x) — f(y) < 0 for all x > y, x, y £ A.
315
CHAPTER 5. ESTABLISHING THE ZOO
Solution. To solve this problem, it suffices to consider the problem of inscribing the largest rectangle into a right triangle with legs of lengths 6/2 and h so that two of the rectangle's sides lie on the legs of the triangle. Thus we reduce the problem to maximizing the function
f(x)=x(h-^) on the interval I = [0,6/2]. Since we have that
f'(x) = h-
for all x G I
and further
/(0) = /(#)= 0, f(x)>0, xel, the function / must take the greatest value on I at its only stationary point x0 = 6/4. Thus the sides of the wanted rectangle are 6/2 long (twice x0: considering the original problem) and h/2 (which can be obtained by substituting 6/4 for a; into the expression h — 2hx/b). Hence we get S = hb/4. □
5.F.3. Among rectangles such that two of their vertices lie on the x-axis and the other two have positive y-coordinates and lie on the parabola y = 8 — 2a;2, find the one which has the greatest area.
Solution. The base of the largest rectangle is 4/\/3 long, the rectangle's height is then 16/3. This result can be obtained by finding the absolute maximum of the function
S(x) = 2x(8- 2x2) on the interval I = [0,2]. Since this function is non-negative on I, takes zero at Fs boundary points, is differentiable on the whole of I and its derivative is zero at a unique point of I, namely x = 2/\/3, it has the maximum there. □
5.F.4. Let the ellipse 3a;2 + y2 = 2 be given. Write the equation of its tangent line which forms the smallest triangle possible in the first quadrant and determine the triangle's area. Solution. The line corresponding to the equation ax + by + c = 0 intersects the axes at the points [— f, 0], [0, — §] and the area of the triangle whose vertices are these two points and
2
the origin is S =       The line which touches the ellipse at
[xT,yr] has the equation 3a; xr+yyr—2 = 0. The area of the
. Further, in
triangle corresponding to it is thus S the first quadrant, we have that xr,yr > 0. To minimize this area means to maximize the product xryr = xt\/2 — "ix^, which is (in the first quadrant) the same as to maximize (xTyT)2 = x2T{2 - 3a;2.) = -Z{x\ - \)2 + §. Hence, the wanted minimum is at xt = . The tangent's equation is \/3x + y = 2 and the triangle's area is Smin = 2^2. □
3xTyT
Thus a function having a non-zero finite derivative at a point is either increasing or decreasing at that point, according to the sign of the derivative.
A function increasing on an interval is increasing at each jji,, of its points. The converse is true as well. In order to see this, assume that / is increasing at all points of the interval A. Consider two points x < y in A with f(y) < f(x). By the assumption, there is a <5-neighbourhood of y on which z < y implies f(z) < f(y). Let So be the infimum of all such S < y — x and w = y — S0. Then f(w) cannot be larger than f(y) (there would be such a point on the right of it too, which is excluded). But, unless w = x, w is a limit point of a sequence of points less than w, for which the value of / is larger than f(y) > f(w). This is a contradiction with / increasing in w. But if w were x then there would be the contradiction with the assumption j(x) > f(z) for points z > x arbitrarily close to x.
The same arguments work for decreasing functions. The following is now proved:
Proposition. A function is increasing or decreasing on an open interval A if and only if it is increasing or decreasing in each its point, respectively.
5.3.3. Examples. There is a function which is increasing at jg^       the origin x0 = 0 but is neither increasing or de-
creasing on any neighbourhood of xq . Consider
^?J&~:~'   tne (continuous) function
j(V) = a; + 5a;2sin(l/a;), /(0) = 0.
The choice /(0) makes / a continuous function on R (sin is a bounded function with values between 1 and —1). Its derivative at zero exists too.
x + hx2 sin(l/a;)
lim
x-tO
= lim (1 + 5x sin(l/a;)) = 1. - +o
For x ^ 0,
f[x) = 1 + 10a;sin(l/a;) - 5cos(l/x)
(cf. the rules for computing derivatives in 5.3.4 below). The derivative is not continuous at the origin.
/ is increasing at x = 0 but is not increasing on any neighbourhood of this point. The one-sided derivatives of this function do not exist at zero either.
As another illustration of a simple usage of the relation between the derivatives and the properties of being an increasing (or decreasing) function, we can consider the existence of inverses to polynomials.
316
CHAPTER 5. ESTABLISHING THE ZOO
5.F.5. At the time t = 0, the three points P, Q, R began moving in the plane as follows: The point P is moving from the point [—2,1] in the direction (3,1) at the constant speed •s/IO m/s. the point Q is moving from [0,0] in the direction (—1,1) with the constant acceleration 2\J~2 m/s2 (beginning at zero speed) and the point R is going from [0,1] in the direction (1, 0) at the constant speed 2 m/s. At which time will the area of the triangle PQR be minimal?
Solution. The equations of the points P, Q, R in time are
P   :   [-2,1] + (3,l)i, Q   :   [0,0] + (-l,l)f2, R   :   [0,1]+ (2,0)f.
The area of the triangle PQR is determined, for instance, by half the absolute value of the determinant whose rows are the coordinates of the vectors PQ and QR (see 1.5.11). So we minimize the determinant:
-2 + t
t
-t2-2t -l+t2
= 2tJ - t + 2.
The derivative is Qt2 — 1, so the externa occur at t = Thanks to considering non-negative time only, we are interested in t = The second derivative of the considered function is positive at this point, thus the function has a local minimum there. Further, its value at this point is positive and less than the value at the point 0 (the boundary point of the interval where we are looking for the externum), so this point is the wanted global minimum. □
5.F.6. At 9 o'clock in the morning, the old wolf left his den D and as a part of his everyday warm-up, he began running counterclockwise St around his favorite stump S at the constant speed 4 kph (not very quick, is he), keeping the constant distance of 1 km from it. At the same time, Little Red Riding Hood set out from her house H straight to her Grandma's cottage C at the constant speed 4 kph. When will they be closest to each other and what will their distance be at that time? The coordinates (in kilometers) are: D = [2,3], S= [2,2], H= [0,0], C= [5,5].
Solution. The wolf is moving along a unit circle, so his angular speed equals his absolute speed and his position in time can be described by the following parametric equations:
x(t) = 2- cos(4f), y(t) = 2- sin(4f),
Polynomials of degree at least two need not be either increasing or decreasing functions. Hence we cannot anticipate that there would be a globally defined inverse function to them. On the other hand, the inverse exists to every restriction of / to an interval between adjacent roots of the derivative /', i.e. where the derivative of the polynomial is non-zero and keeps the sign. These inverse functions will never be polynomials, except for the case of polynomials of degree one. The equation
implies
y = ax + b
= -(y-b).
For a polynomial of degree two, the equation
y = ax + bx + c
leads to the equation
_-b± xjb2 - 4a(c - y) X ~ 2a '
Thus the inverse (given by the above equation) exists only for those x which are in either of the intervals (—oo, —
(-&>°°)-
It can be shown that the roots of polynomials of order larger than four cannot in general be expressed by means of power functions. Thus piece-wise defined inverses to polynomials may represent new items in the zoo.
5.3.4. Elementary properties of derivatives. We introduce several basic facts about the calculation of derivatives. We shall see that the derivatives are quite nicely compatible with the algebraic operations of addition and multiplication of real or complex functions. The last formula then allows us to efficiently determine the derivatives of composite functions. It is also called the chain rule.
Intuitively, they can be understood very easily if we imagine that the derivative of a function y = j(x) is the quotient of the rates of increase of the output variable y and the input variable x:
r =
5y_
Sx
Of course, for y = h(x) = j(x) + g(x), the increase in y is given by the sum of the increases of / and g, and the increase of the input variable is still the same. Therefore, the derivative of a sum is the sum of the derivatives.
The derivative of a product is not the product of the derivatives. For y = j(x)g(x), the increase is
Sy = f(x + 5x)g(x + 5x) - f(x)g(x) = f(x+8x)(g(x+8x)-g(x)) + (f(x+8x)- f(x))g(x)
317
CHAPTER 5. ESTABLISHING THE ZOO
Little Red Riding Hood is then moving along the line
x(t) = 2y/2t, y(t) = 2y/2t.
Let us find the externa of the (squared) distance p of their paths in time:
p{t) = [2 - cos(4i) - 2V2t]2 + [2 - sin(4i) - 2v/2i]2, p'(t) =16(cos(4i) - sin(4i))(v/2f - 1) + 32i+ + 4^(008(4/) + sin(4i)) - 16v^.
It is impossible to solve the equation p'(t) = 0 algebraically, we can only find the solution numerically (using some computational software). Apparently, there will be infinitely many externa: every round, the wolf's direction is at some moment parallel to that of Little Red Riding Hood, so their distance is decreasing for some period; however, Little Red Riding Hood is moving away from the wolf's favorite stump around which he is moving. We find out that the first local minimum occurs at t = 0.31, and then at t = 0.97, when the distance of our heroes will be approximately 5 meters. Clearly this is the global maximum as well.
The situation when we cannot solve a given problem explicitly is quite common in practice and the use of numerical methods is of great importance. □
5.F.7. Halley's problem, 1686. A basketball player is stand-
'^(S^ m^ *D ^r°nt °^ a ^as'ce1'' at distance / from its rim which is at height h from the throwing point. Determine the minimal initial speed v0 which the player must give to the ball in order to score, and the angle p corresponding to this vq. See the picture.
Solution. Once again, we will omit units of measurement: we can assume that distances are given in meters, times in seconds (and speeds in meters per second then). Suppose the player throws the ball at time t = 0 and it goes through the rim at time to > 0. We will express the ball's position (while flying) by the points [x(t),y(t)] for t e [0,t0], and we require that x(0) = 0, y(0) = 0, x(t0) = I, y(t0) = h. Apparently,
x'(t) = vq cos p,       y1 (t) = vqsitip — gt
fori G (0, fo). where g is the gravity ofEarth, since the values x'(t) and y'(t) are, respectively, the horizontal and vertical speed of the ball. By integrating these equations, we get
x(t) = Vgt COS p + Ci, y(t) = Vgt sill <p — \ gt2 + C2
for t G (0, to) and ci, c2 G K. From the initial conditions
Now, if we make the increase 5x small, we actually calculate the limit of a sum of products, which is the sum of the products of the limits. Thus the derivative of a product fg is given by the expression fg' +f'g, which is called the Leibniz rule10.
The derivative of a composite function is even more interesting: Consider a function
g = h o /,
where the domain of the function z = h(y) contains the codomain of the function y = f(x). By writing out the increases,
,     Sz     Sz Sy ^     Sx     Sy Sx Thus we expect that the formula will be of the form
(hof)'(x) = h'(f(x))f'(x).
Now we provide correct formulations together with proofs:
Rules for differentiation
Theorem. Let f and g be real or complex functions defined on a neighbourhood of a point x0 G K and having finite derivatives at this point. Then
(1) for every real or complex number c, the function x n> c ■ f(x) has a derivative at x0 and
(cf)'(xo) = c(f'(x0)),
(2) the function f + g has a derivative at x0, and
(f + g)'(xo) = f'(xo)+g'(x0),
(3) the function f ■ g has a derivative at xq, and
(f ■ g)'(x0) = f'(x0)g(x0) + f(x0)g'(x0).
(4) Further, suppose h is a function defined on a neighbourhood of the image yo = f(xo) with a derivative at yo-Then the composite function ho f also has a derivative at x0, and
(hOf)'(x0) = ti(f(x0))-f'(x0).
Proof. (1) and (2): A straightforward application of the theorem about sums and products of function limits yields the result.
(3) Rewrite the quotient of the increases (already mentioned), in the following way:
(/g)to-(/g)(tto)
X — Xq
X — Xq X — Xq
The limit of this as x —> x0 gives the desired result because / is continuous at x0.
(4) By lemma 5.3.1, there are functions t/j and p which are continuous at x0 and yo = f(xo)- Further they satisfy
Kv) = h{Vo)+f{v){y-Vo), f(x) = f(x0)+ip(x)(x-x0)
10Gottfried Wilhelm von Leibniz (1646-1716) was a great German mathematician and philosopher. He developed the differential and integral calculus in terms of the infinitesimal quantities, arguing similarly as above.
318
CHAPTER 5. ESTABLISHING THE ZOO
tlim x(i) = x(0) = 0, hjfi) = y(0) = 0,
it follows that ci = c2 = 0. Substituting the remaining conditions
lim x(t) = x(to) = I, lim y(t) = y(to) = h
t—Yt®— t—Yt® —
then gives
/ = vqIq cos p, h = voto sin p — | gig.
According to the first equation, we have that (1) t0 =---,
Vq cos ip
and thus we get only one equation
gl2
(2)
h = Itantp -    2    2 ' 2vq cosz p
where w0 G (0, +oo),pe (0,7r/2).
Let us remind that our task is to determine the minimal v0 and the corresponding p which satisfies this equation. To be more comprehensible, we want to find the minimal value of v0 for which there is an angle p satisfying (2). Since
i
cos   (y9+sin if
1+tanV,      p£ (°, f)'
cos^ p cos^ ip ^       V1 2 y
the equation (2) can be written in the form
h - I tan p + ifs (l + tan2 ip) = 0,
on some neighbourhoods of x0 and yo- They satisfy ip(x0) =
f'(x0) and p(y0) = h'(y0). Then,
h(f(x)) - h(f(x0)) = p(f(x))(f(x) - f(x0)) = <p(f(x))ip(x)(x - x0)
for all x from the neighbourhood of x0. However, the product p(f(x))ip(x) is a function which is continuous at x0 and its value at x0 is just the desired derivative of the composite function, again by lemma 5.3.1. □
5.3.5. Derivative of quotients. Consider first the special case of h(x) = a-1. From the definition of the derivative,
h! (x) = lim
l    _ l
x+8x X
= lim
x — x — Sx
<5x-s-o     Sx <5x-s-o Sx(x2 + xSx)
lim
-1
Sx-to x2 + xSx Thus, the above leads to:
Derivative of a quotient
Corollary. Let f and g be real-valued functions which have finite derivatives at a point xq and g(xo) =f^ 0. Then the function h(x) = f {x){g{x))~1 satisfies
' f\,     _ f'(xo)g(x0) - f(x0)g'(x0)
h'(x0) = (^j (a0) =
(sM)2
2hv„
\&nzp-^\smp+^ + 1 = 0.
From the last equation (quadratic equation in p = tan p), it follows that
tan p ■
_1+ /4"° x(^l
+1
i. e.
(3)
v2     Jvf. - 2hv2g - g2l2 tan<^ = -t ± —-—-—•
gl gl
Therefore, the angle p satisfying (2) exists if and only if
vi-2ghv2-g2l2>0.
Once again, a suitable substitution (this time q = v2,) allows us to reduce the left side to a quadratic expression and subsequently to get
(v2 -g[h + Vh^TP]) {v2 -g[h - Vh^+W]) > 0. As h < y'h2 + I2, it must be that
vl > g [h + V/i2 + I2] ,   i. e.   v0 >
^g [h+VWfl^}.
The least value
(4) v0
= Jg \h + \fh2 + t-
Proof. Using the formula (a x)' = —x 2, the chain rule says
(g
-1)' = -s-2-s'-
The Leibniz rule implies
-2 , _ f'g-gf
(//<?)' = (f-g-1)' = f'9-1-fg-29' -
□
5.3.6. Derivatives of inverse functions. In paragraph 1.6.1, while talking about relations and mappings in general, the concept of an inverse function was introduced. If the inverse function /_1 to a l^rd^^-~ given function / : R —> R exists (do not confuse this notation with the function a; i-> (/(a))-1), then it is uniquely determined by either of the following identities
J"1 ° / = id»,   / ° f'1 = ids,
. Then the other identity is also true. If / is denned on a set id and f(A) = B, the existence of J-1 is conditioned by the same statements with identity mappings id a and ids, respectively, on the right-hand sides. As seen in the diagram, the graph of the inverse function is obtained simply by interchanging the axes of the input and output variables.
319
CHAPTER 5. ESTABLISHING THE ZOO
is then matched (see (3)) by
tany ■
(5)
h + ^h2 +i2 i '
i. e.   <p = arctg
h + Vh2 + I2 ~l ■
The previous calculation was based upon the conditions x(t0) = I, y(t0) = h only. However, these only talk about the position of the ball at the time t0, but the ball could get through the rim from below. Therefore, let us add the condition y'(to) < 0 which says that the ball was falling at the time, and let us prove that it holds for vq from (4) and p from (5).
Let us remind that we have (see (1), (2))
o
i
Vq cos if '
Using this, from
_2il_
2(1 tan p—h) cos2 p '
y'^o) = t lim y'{i) = v0 sinp- gt0 < 0
t—yta —
we get
sin p       sin p cos p 1
2(1 tan ip — h) cos2 ip
i. e. the equality
/ sin p cos p < 2(1 tan ip — h) cos2 p, from which we can easily see that ^ < tan p.
By confrontation with (5), we get that the last inequality really holds because
tany ■
h+^h2+l2 h+Vh2
2h I ■
Thus we have shown that for the initial speed from (4), the player is able to score.
During the free throw, supposing the player lets the ball go at the height of 2 m, we have
h = 1.05 m,   1 = 4.225 m,   g = 9.80665 m ■ s"2, and so the minimal initial speed of the ball is
v0 = ^9.80665 [l.05 + 7(1.05)2 + (4.225)2Jm■ s"1 =
7.28 m- s"1. The corresponding angle is then
* = "«*g 9.806 61.4.225 = °-907 fad « 52 °-
Let us think for a while about the obtained value of the angle p for the initial speed vq. According to the picture, we have
2/3 + (it — a) = tt      and      a + 7 whence it follows that
2'
If it is known that the inverse y = f~1(x) of a differen-tiable function x = f(y) is also differentiable, then the chain rule yields immediately
l = (id)'(a;) = (/o/-1)'(a;) = /'(y).(/-1)'(a;).
Notice that f'(y) must be non-zero.
This corresponds to the intuitive idea that for y = f(x), the value of /' is approximately || while for x = f~1(y) it is approximately (/_1)'(y) = ||. And this indeed is the way the derivatives of inverse functions are calculated.
Derivative of the inverse function
So it holds that
Theorem. If f is a real-valued function differentiable at yo, such that the inverse f~1(x) exists on a neighbourhood of the value x0 = /(yo) ond /'(yo) V 0, then
(1)     {rly{xo) = 7xrW) = lh'
Proof. To prove the proposition, it suffices to read the proof of the fourth statement of theorem 5.3.4. We work with the composition / o /_1 = id there. The composite function is differentiable. By lemma 5.3.1, there is a function p continuous at yo such that
f(y) - /(yo) = <p(y)(y - yo),
on some neighbourhood of yo. Further, it satisfies p(yo) = /'(yo) V 0 and p has constant sign on a neighbourhood of x0. Next, notice that the existence of the inverse /_1 around the point x0 and the continuity of / at yo guarantees the continuity of /_1 at x0, (the e and S neighbourhoods map each to the other bijectively). The substitution y = f~1(x) then yields
x-x0 = <p(f-1(xMr1(x)-r1(xo)),
320
CHAPTER 5. ESTABLISHING THE ZOO
V = 5-0 = 5 + i = _(5 + 7. =Hf +arctgf). We have obtained that the elevation angle corresponding to the throw with minimal energy is the arithmetic mean of the right angle and the angle at which the rim is seen (from the ball's position).
The problem of finding the minimal speed of the thrown ball was actually solved by Edmond Halley as early as in 1686, when he determined the minimal amount of gunpowder necessary for a cannonball to hit a target which lies at greater height (beyond a rampart, for instance). Halley proved (the so-called Halley's calibration rule) that to hit a target at the point [/, h] (shooting from [0,0]) one needs the same minimal amount of gunpowder as when hitting a horizontal target at distance h + \/h2 + I2 (at the angle p = 45 °). Halley also demonstrated that the value of p is stable with regard to small difference of the amount of used gunpowder and insignificant errors in estimating the target's distance. □
5.F.8.   A bullet is shot at angle p from a point at height h v\\_^v above ground at initial speed v0. It will fall on the ground at distance R from the point of shot (see the tp    picture). Determine the angle p for which the value of R is maximal.
Solution. We will express the bullet's position in time by the points [x(t), y(t)]. We assume that it was shot at time t = 0 from the point [0,0] and it will fall on the ground at the point [R,—h] at certain time t = t0, i. e. x(0) = 0, y(0) = 0, x(t0) = R, y(to) = —h. Similarly to Halley's problem, we will consider the equations
x'(t) = vq cos p,    y'(t) = vq sin p — gt,       £ £ (0, to)
for the horizontal and vertical speeds of the bullet, where g is the gravity of Earth.
We can continue as when solving the previous problem: by integrating these equations (taking x(0) = y(0) =0 into consideration), we get
for all x lying in some neighbourhood O(x0) of x0. Further, f~1(x0) = yo, and i(s(/~1(a;)) is continuous at x0 and remains non-zero on a neighbourhood O(x0) of x0 with constant sign. Thus
f-Hx)-f-Hx0) =       1 ^ x-xq p(f~1(x))
for all x e O(xo) \ {xq}. The right-hand side of this expression is continuous at x0. The limit is
lim
^0p{j-\x))     p{f~\x0)) /'(y0)'
Therefore, the limit of the left-hand side also exists, and it follows that
(r1)'^)
j yyuj
as required. □
1
/'(yo)
5.3.7. Derivatives of the elementary functions. Consider the exponential function f(x) = ax for any fixed real a > 0. If the derivative of ax exists for all x, then
f(x) = lim.
= ax lim
1
= /"'(ok-
<5x-s-0        Sx <5x-s-0 Sx
On the other hand, if the derivative at zero exists, then this formula guarantees the existence of the derivative at any point of the domain and also determines its value. At the same time, the validity of this formula for one-sided derivatives is also verified.
Unfortunately, it takes some time to verify that the derivatives of exponential functions indeed exist (see 5.4.2, i, and 6.3.7).
There is an especially important base e, sometimes known as Euler's number, for which the derivative at zero equals one.
Remember the formula (ex)' = ex for a while and draw on its consequences.
For the general exponential function, (using standard rules of differentiation),
(axY = (e111^)' = ln(a)(eln(a^) = ln(a) ■ ax.
Thus exponential functions are special since their derivatives are proportional to their values.
Next, we determine the derivative (lne(a;))'. The definition of the natural logarithm as the inverse to ex,
elnx = x,
allows the calculation:
(1)        (ln)'(y) = (In) V) =
1        1 1
(ex)'     ex y
The formula
(2) (xa)' = ax11'1
for differentiating a general power function can also be derived using the derivatives of the exponential and logarithmic functions:
(xay = (eallaxy = ealnx(alnxy = a— = a/"1.
321
CHAPTER 5. ESTABLISHING THE ZOO
x(t) = Vgt COS 03,     y(t) = Vgt sill 03 — | gt2,
(0,t0),
and from the conditions limt^to_ x(t) = x(t0) limt^to_ y(t) = y(t0) = —/i, we then have that
R = voto cos p,    — h = voto sinip — ^ gig.
From the first equation, it follows that
t e
R,
to =
t;o cos if '
so we can express the previous two equations by the single equation
gR2
(1)
— h = i? tan ip —
2vq cos2 p'
where 03 G (0, tt/2).
Unlike with Halley's problem, the value of v0 is given and R is variable (dependent on ip). So, actually, there is a function R = -R(o3) (in variable ip) which must satisfy (1) (it is determined by the equation (1)). Thus, this function is given implicitly. The equation (1) can be written as (R is substituted by R(<p))
R(<p) tano3 ■ 2vq cos2 p — gR2(p) + h ■ 2v$ cos2 p = 0. Using the relation
2 tan ip cos2 p = sin 2p, we can transform (1) into the form
(2)     R(p)vl sin2y> - gR2(p) + 2hv\ cos2 p = 0.
Differentiating with respect to ip now gives
R'(ip)v$ sin2ip + 2R{ip)vlcos2p - 2gR(p)R'(p) -2IiVq (2 cos p sin p) = 0,
i. e.
R'(p) [v2sm2p-2gR(p)] = —2R(p)vq cos 2p + 2hvg sin 2p. Thus we have calculated that
W=^fyiX"-1iS("2yl. ^(0,f).
It suffices to verify that Vq sin2i,s — 2gR(p) ^ 0 for every p G (0,7r/2). Let us suppose the contrary and substitute
R
_ vQ sin 2ip _       vQ sin ip cos
29
9
into (1), obtaining
^ _ Vq Sin (y9 COS f
tan 03--„22—2—
T 2qzVq cos^ ip
Simple rearrangements lead to
,  _ vl sin2 ip
which cannot happen (the left side is surely negative while the right one is positive).
5.3.8. Mean value theorems. Before continuing the journey of finding new interesting functions, we derive several simple statements about deriva-p?T~> tives. The meaning of all of them is intuitively clear from the diagrams. The proofs follow the visual imagination.
Rolle's theorem
li
Theorem. Assume that the function f : R —> R is continuous on a closed bounded interval [a, b] and differentiate inside this interval. If f(a) = f(b), then there is a number c G (a, 6) such that /'(c) = 0.
Proof. Since the function / is continuous on the closed interval (i.e. on a compact set), it attains its maximum and its minimum there. Either its maximum value is greater than f(a) = /(&), or the minimum value is less than /(a) = /(&), or / is constant. If the third case applies, the derivative is zero at all points of the interval (a, 6). If the second case applies, then the first case applies to the function —/.If the first case applies, it occurs at an interior point c. If /'(c) ^ 0 then the function / would be either increasing or decreasing at c (see 5.3.2), implying the existence of larger values than /(c) in a neighbourhood of c, contradicting that /(c) is a maximum value. □
5.3.9. The latter result immediately implies the following corollary.
1 ^The French mathematician Michel Rolle (1652-1719) proved this theorem only for polynomials. The principle was perhaps known much earlier, but the rigorous proof comes from the 19th century only.
322
CHAPTER 5. ESTABLISHING THE ZOO
So we were able to determine R'(p) for all p G (0,7r/2). What is more, we can immediately see that this derivative is zero if and only if
h sin 2p = R(p) cos 2p,      i.e.      R(p) — h\aa2ip.
Since the function R must have a maximum on the interval (0,7r/2) (according to the physical meaning of the problem, for p —> 0+ and for p —> 7r/2— the value of i? decreases) and is differentiable at every point of this interval, it has its maximum at the point where its derivative is zero. This means that R(p) can be maximal only if
(3) R(p) = hum2p>.
Let us thus substitute (3) into (2). We obtain
h tan 2p Vq sin 2p — gh2 tan2 2p + 2hvg cos2 p = 0,
and let us transform this equation:
tan 2p Vq sin 2p + 2vq cos2 p = gh tan2 2p,
«o2S+^ (cos 2p + l) =ghf^
.2 <
cos2 2ip '
vl sin2 2p + vl cos2 2p + v$ cos 2p = gh
v2 (1 + cos 2p) = gh (1-cos2c^(21y+cos2y),
«o cos 2p = gh(l — cos 2p), and cos 2p = vv+gh •
However, by this we have uniquely determined the point
i arccos
at which i? is highest. Since sin 2p0 = \J\ — cos2 2p$
r-"ip- Jvt+2ghvl
i/l — , 9   , ^ = x—i-r—L—, we have
\/"o+2»h"o r-,-—
R (tp0) = h tan 2p0 = h ■
Let, for instance, javelin thrower Barbora Spotakova give ajavelin the speed v0 = 27.778 m/s = 100 km/hat the height h = 1.8 m (with g = 9.806 65 m ■ s~2). Then the javelin can fly up to the distance
R (Vo) = 92806765 V27.7782+2 ■ 9.806 65 ■ 1.8 m= 80.46m.
This distance was achieved for
9.806 65-1.8 jl
0.774 2 rad « 44.36c
However, the world record of Barbora Spotakova does not even approach 80 m although the impact of other phenomena (air resistance, for example) can be neglected. Still we must not forget that from 1 April 1999, the center of gravity of the women's javelin was moved towards its tip upon the decision of IAAF (International Association of Athletics Federation). This reduced the flight distance by around 10 %.
Lagrange's mean value theorem
Theorem. Assume the function f : R —> R is continuous on an interval [a, b] and differentiable at all points inside this interval. Then there is a number c G (a, 6) such that
Proof. The proof is a simple statement of the geometrical meaning of the theorem: The secant line ^ d!f between the points [a, f(a)] and [b, /(&)] has a -dUi^s^ tangent line which is parallel to it (have a look at the diagram). The equation of the secant line is
y = g(x) = fw + —t-(x - a)-
b — a
The difference h(x) = f(x) — g(x) determines the (vertical) distance of the graph and the secant line (in the values of y). Surely h(a) = h(b) and
h'(x) = f(x) -
f(b) - /(a)
b — a
By the previous theorem, there is a point c at which h'(c) = 0. □
The mean value theorem can also be written in the form:
(1) f(b) = f(a) + f(c)(b-a).
In the case of a parametrically given curve in the plane, i.e. a pair of functions y = j(t), x = g(t), the same result about the existence of a tangent line parallel to the secant line going through the boundary points is described by Cauchy's mean value theorem:
323
CHAPTER 5. ESTABLISHING THE ZOO
The original record (with "correctly balanced" javelin) was 80.00 m.
The performed reasoning and the obtained result can be applied to other athletic disciplines and sports. In golf, for instance, h is close to 0, and thus it is just the angle
ipo = lim i arccos //* , = A
\ arccos 0 = f rad = 45c
at which the ball falls at the greatest distance
r    = ä f V^+igh = vf.
Let us realize that our calculation cannot be used for h = 0 (ip0 = 7r/4) since then we would get the undefined expression tan (7r/2) for the distance R. However, we have solved the problem for any h > 0, and therefore we could get a helping hand form the corresponding one-sided limit. □
Cauchy's mean value theorem
Corollary. Let functions y = f(t) and x = g(t) be continuous on an interval [a, b] and differentiate inside this interval, and further let g1(t) 0 for all t G (a,b). Then there is a point c G (a, 6) such that
f(b)-f(a) _f(c) g(b)-g(a) g'(c)-
Proof. Put
h(t) = (/(b) - f(a))g(t) - (g(b) - g(a))f(t).
Now h{a) = f(b)g(a) - f(a)g(b), h(b) = f(b)g(a) -f(a)g(b), so by Rolle's theorem, there is a number c G (a, b) such that h'{c) = 0.
Finally, the function g is either strictly increasing or decreasing on [a, b] and thus g(b) =^ g(a). Moreover, g'(c) =^ 0 and the desired formula follows. □
5.3.10. A reasoning similar to the one in the above proof leads to a supremely useful tool for calculating limits of quotients of functions.
L'Hopital's rule12
Theorem. Suppose f and g are functions differentiate on some neighbourhood of a point xq G K, yet not necessarily at x0 itself. Suppose
lim f(x) = 0,        lim g(x) = 0
. If the limit
exists, then the limit
lim-M
x-*xa g'yx)
lim
x^-xo g(x) also exists, and the two limits are equal.
5.F.9.   Regiomontanus' problem, 1471.
In the museum, there is a painting on the wall. Its lower edge is a meters above ground and its upper edge b meters, then (its height thus equals b — a). A tourist is looking at the painting, her eyes being at height h < a meters above ground. (The reason for the inequality h < a can, for instance, be to allow more equally tall visitors to view the painting simultaneously in several rows.) How far from the wall should the tourist stand if she wants to maximize her angle of view at the painting?
LLHosmtSLovo Tz/www
12Guillaume Francois Antoine, Marquis de l'Hopital, (1661-1704) became famous for his textbook on Calculus. This rule was first published there, perhaps originally proved by one of the famous Bernoulli brothers.
324
CHAPTER 5. ESTABLISHING THE ZOO
Solution. Let us denote by x the distance (in meters) of the tourist from the wall and by p her angle of view at the painting. Further, let us set (see the picture) the angles a, j3 e (0, tt/2) by
tana =
tan/3 =
Our task is to maximize <p = a — f3. Let us add that for h > b, one can proceed analogously and for h e [a, b], the angle ip increases as x decreases (p = it for x = 0 and h e (a, b)).
From the condition h < a it follows that the angle p is acute, i. e. p e (0,7r/2). Since the function y = tana; is increasing on the interval (0, tt/2), we can turn our attention to maximizing the value tan p. We have that
tan ^ = tan (a -/3) = = =
x(b—a)
x2 + {b-h){a-h) ■
So it suffices to find the global maximum of the function
f(x) = x* + *bL.l$a-h),       x G [0, +oo).
From the expression
,{x) = (b-a^x (b-a)[(b-h)(a-h)-x2]
n _ {b-a)[x2 + {b-h){a-h)]-2x2(b-a) _ ~ [x2 + (b-h)(a-h)]2 ~~
[x2 + (b-h)(a-h)Y '
we can see that
G (0,+c
f'(x) > 0   for   xe (o, y/(b- h)(a- h)^j , f'(x)<0   for   x e (V(č>-/i)(a-/i),+oo)
Proof. Without loss of generality, the functions / and g are zero at the point x0. The quotient of the values then corresponds to the slope of the secant line between the points [0,0] and [j(x),g(x)]. At the same time, the quotient of the derivatives corresponds to the slope of the tangent line at the given point. Thus it is necessary to verify that the limit of the slopes of the secant lines exists from the fact that the limit of the slopes of the tangent lines exists.
Technically, we can use the mean value theorem in Cauchy's parametric form. First of all, the existence of the expression /' (x) Jg' (x) on some neighbourhood of the point x0 (excluding x0 itself) is implicitly assumed. Thus especially for points c sufficiently close to x0, g'(c) =^ 0.13 By the mean value theorem,
lim
:->x0 g[x)
= lim
m - f(xo)
f'(Cx)
= lim
x->xa g'(cx
«o g(x) - g(x0)
where cx is a number lying between x0 and x, dependent on x. From the existence of the limit
lim
x—yxo
g'(x)'
it follows that this value will be shared by the limit of any sequence created by substituting the values x = xn approaching x0 into f'(x)/g'(x) (cf. the convergence test 5.2.15). Especially, we can substitute any sequence cXn for xn —> xq, and thus the limit
lim^M
x^xo g'(cx)
exist, and the last two limits are equal. Hence the desired limit exists and has the same value. □
From the proof of the theorem, it is true for one-sided limits as well.
5.3.11. Corollaries. L'Hospital's rule can easily be extended for limits at the improper points ±oo and for the case of infinite values of the limits. If, for instance, we have
lim f(x) = 0,    lim g(x) = 0,
x—too x—too
then limx^0+ f0-/x) = 0 and limx^0+ 90-/x) = 0.
At the same time, from existence of the limit of the quotient of the derivatives at infinity,
lim
lim
/'(l/a;)(-l/a;2)
x^o+ (g(l/x))'    x-o+ 5'(l/a;)(-l/a;2)
- lim ^4 = lim m.
x^o+ g'(l/x)     i-s-oo g'(x)
Applying the previous theorem, the limit
lim
lim
fO-M
lim
x-)-oo g(x)      x-*o+ g(\/x)      i-i-oo g1 (x)
This is not always necessary for the existence of the limit in a general sense. Nevertheless, for the statement of l'Hospital's rule, it is. A thorough discussion can be found (googled) in the popular article 'R. P. Boas, Counterexamples to L'Hospital's Rule, The American Mathematical Monthly, October 1986, Volume 93, Number 8, pp. 644-645.'
325
CHAPTER 5. ESTABLISHING THE ZOO
Hence the function / has its global maximum at the point
x0 = V'(6 — h)(a — h) (let us remind the inequalities h <
a < b).
The point x0 can, of course, be determined by other means. For instance, we can (instead of looking for the maximum of the positive function / on the interval (0, +oo)) try to find the global minimum of the function
g(x) =     1     = *2 + (b-h)(a-h) =
^_ (b-h)(a-h)
b—a
+
x(b-a) '
G (0,+c
with the help of the so-called AM-GM inequality (between the arithmetic and geometric means)
^^^vTm, vi,V2>0,
where the equality occurs iff y± = y2. The choice then gives
g(x) = Vi(x) + y2(x) > 2 yjyi{x) y2{x) = ^aV(b-h) (a-h).
Therefore, if there is a number x > 0 for which y± (x) = y2(x), then the function g has the global minimum at x. The equation
yi(x)=y2(x),   i.e.   ^a = l£$sp,
has a unique positive solution x0 = ^J{b — h)(a — h).
We have determined the ideal distance of the tourist from the wall in two different ways. The angle corresponding to x0
Po = arctan
x0(b—a)
arctan-
b—a
xl + {b-h){a-h) 2 ^/(b-h)(a-h) '
When looking at the painting from the ground (being an ant, for instance), we have h = Q, and so
x0
ab,
Po = arctan ■
If the painting is 1 meter high and its lower edge is 2 meters above ground (a = 2, b = 3), then the ant will see the painting at the largest angle yo = 0.2014rad — 11.5° at the distance x0 = 2.45 meters from the wall. If this painting is viewed by a man whose eyes are at the height of 1,8 meters, together with his son whose eyes are 1 meter above ground, then the father should stand x0 = 0.49 meters from the wall and his son x0 = 1.41 meters, then. We can notice that the father has (po = 0.795 6 rad — 45.6° whereas his son has po = 0.339 8 rad - 19.5 °. The quotient
0.795 6 ^ 45JS ±00 0.339 8 ~ 19.5 z'°
proves what a strongly better view the father has.
□
exists in this case as well.
The limit calculation is even simpler in the case when
lim f(x) = ±00,     lim g(x) = ±00.
X—YXq X^Xq
Then it suffices to write
lim
lim
x^x0 g(x)      x-s-xq l/fix)'
which is already the case of usage of l'Hospital's rule from the previous theorem. It can be proved that l'Hospital's rule has the same form for infinite limits as well:
Theorem. Let f and g be functions differentiable on some neighbourhood of a point x0 G R, not necessarily at x0 itself. Further, let the limits \imx^Xo f(x) = ±00 and \imx^tXo g(x) = ±00 exist. If the limit
/'(*)
exists, then the limit
lim
x-txo g'yx)
ita m
x^tx0 gyx)
also exists and they equal each other.
Proof. Apply the mean value theorem. The key step is to express the quotient in a form where the derivative arises:
/(*)
/(*)
f(x) - f(y)  g(x) - g(y)
0(x)     f(x)-f(y)   g(x)-g(y) g(x)
where y is fixed, from a selected neighbourhood of x0 and x is approaching x0. Since the limits of / and g at x0 are infinite, we can surely assume that the differences of the values of both functions at x and y, having fixed y, are non-zero.
Using the mean value theorem, replace the fraction in the middle with the quotient of the derivatives at an appropriate point c between x and y. The expression of the examined limit thus gets the form
g(x)
1 -
g(y) g(x)
1 -
Av)
9'(cY
where c depends on both x and y. As x approaches x0, the former fraction converges to one. If y is simultaneously moved towards x0, the latter fraction becomes arbitrarily close to the limit value of the quotient of the derivatives. □
5.3.12. Example. By making suitable modifications of the examined expressions, one can also apply l'Hopital's rule on forms of the types 00 — 00,100, 0 ■ 00, and so on. Often one simply rearranges the expressions or uses some continuous function, for instance the exponential one.
For an illustration of such a procedure, we show the connection between the arithmetic and geometric means of n non-negative values Xi. The arithmetic mean
M1(x1,...,xn)
x\ H-----Vxn
326
CHAPTER 5. ESTABLISHING THE ZOO
5.F.10. Snell's law. Determine the refracted light ray between the point A in a homogeneous space with speed of light vi and the point B in a homogeneous space with speed of light v2. See the picture.
Solution. Once again, we will omit units of measurement. We can assume that distances are given in meters, speeds v1, v2 in meters per second (and time in seconds, then). The ray is determined by Fermat's principle of least time: of all the paths between the points A and B, the light will go along the one which can be traversed in the least time. In homogeneous spaces, the ray will be a straight line (in this case, we will consider its segment). So it suffices to determine the point R (given by the value x) where the ray refracts. The distance between the points A and R is \Ai + x2, between points R and B it is \Jh\ + (d — x)2, then. The total time of the transmission of energy between the points A and B is thus given by the function
T(
(X) = -»--h J-
in the variable x £ [0,d\. Let us emphasize that we want to find the point x £ [0, d] at which the value T(x) is minimal. The derivative
T'(x) =---, d~x
V! ^/hj+x?       v2 ^Jhl + (d-*)2
is a continuous function on the interval [0, d], so its sign can be easily described by its zero points. From the equation
T'(x) = 0,   i. e.
it follows that
V2 '
\/h| + (d-x)2
This expression is useful for us because (see the picture)
d—x
sin ipi
i , Sin 03o — ;
^/h\+x* ^ yjhl + (d-xy
Thus there is at most one stationary point; it is determined by
(i) sinyi = ^i
sino32 v2
Let us realize that as ipi £ [0, tt/2] increases (when x increases), the angle o?2 £ [0, tt/2] decreases. The sine is non-negative and increasing on the interval [0, tt/2], so the quotient (sin ipi)/(sin ip2) is increasing with respect to x. Since T'(0) < 0 and T'(d) > 0, there is exactly one stationary point x0. From the inequalities T'(x) < 0 for x £ [0,x0) and T'(x) > 0 for a; £ (x0, d], it follows that there is the global minimum at the stationary point x0.
is a special case of the power mean with exponent r, also known as the generalized mean:
x\ +
+ xi
Mr(x1,xn)
The special value M-1 is called the harmonic mean. Calculate the limit value of Mr for r approaching zero. For this purpose, determine the limit by l'Hopital's rule (we treat it as an expression of the form 0/0 and differentiate with respect to r, with x{ as constant parameters).
The following calculation uses the chain rule and knowledge of the derivative of the power function, must be read in reverse. The existence of the last limit implies the existence of the last-but-one, and so on.
lim \n(Mr(Xl,... , xn)) = lim lnÜK + '■■ + <))
X"17 In X\-\-----\~xn ln xn
= lim r  " r-
n
\nxi + ■ ■ ■ + \nxn
In ýxi.....xn.
Hence
lim Mr(xi,.. . ,xn) = ýxi ...xr,
r-i-0
which is known as the geometric mean.
4. Infinite sums and power series
The last part of this chapter is mainly devoted to infinite sums of numbers, aiming at infinite extension of polynomials - the so called power series.
We first complete the basic discussion of the exponential function, expressing it as a limit of polynomial approximations. This illustrates the more general need to develop effective tools to deal with sequences of numbers or functions. If the reader finds the next paragraphs too demanding, we suggest jumping to 5.4.3 starting the general discussion on infinite sums of numbers and maybe return later.
5.4.1. The calculation of ex. For numerical computations, manipulation with limits of sequences is needed as well as addition and multiplication of scalars. Thus it might be a good idea to approximate non-polynomial functions by sequences of numbers that can be calculated easily, keeping control of the approximation errors.
We approach the function ex this way. In view of the expected property (ex)' = ex (cf. 5.3.7), we look for a function whose rate of increase equals the function's value at every point. This can be imagined as a splendid interest rate equal to the current value of your money.
If we apply the interest rate per year once a month, once a day, once an hour, and so on, we obtain the following values for the yield x of the deposit after one year:
327
CHAPTER 5. ESTABLISHING THE ZOO
Let us summarize the preceding: The ray is given by the point R of refraction (i. e. the value a0), and the point R is given by the identity (1), which is called Snell's law in physics.
The quotient of v1 and v2 is constant for the given homogeneous spaces and determines an important quantity which describes the interface of optical spaces. It is called a refractive index and denoted by n. Usually, the first space is vacuum, i. e. v1 = c, and v2 = v, thus obtaining the (absolute) index of refraction n = c/v. For vacuum, we get n = 1, of course. This value is also used for air since its refractive index at the standard conditions (i. e. pressure of 101 325 Pa, temperature of 293 K and absolute humidity of 0.9 g m~3) is n = 1.000272. Other spaces have n > 1 (n = 1.31 for ice, n = 1.33 for water, n = 1.5 for glass).
However, the refractive index also depends on the wave length of the electromagnetic radiation in question (for example, for water and light, it ranges from n = 1.331 to n = 1.344), where the index ordinarily decreases as the wave length increases. The speed of light in an optical space having n > 1 depends on its frequency. We talk about the dispersion of light. The dispersion causes rays of different colors to refract at different angles. (The violet ray refracts the most and the red ray refracts the least.) This is also the origin of a rainbow. We can further remind the well-known Newton's experiment with a glass prism from 1666.
Eventually, let us remark that our task always has a solution because we can choose the point R arbitrarily. If, together with the speeds v1 and v2, the angle ipi were given as well (our task could then be to calculate where the ray going from the point A intersects the line y = c for a certain c < 0 when the interface of optical spaces is on the a;-axis), then the angle p2 e (0,7r/2) satisfying (1) might not exist. This corresponds to the total reflection (there is no refracted light at all). □
Further miscellaneous problems concerning externa of functions of a real variable can be found at ??
5.F.11. Prove, that the polynomial P(x) = x5 — x4 +2a3 — x2 + x + 1 has exactly one real root.
Solution. Any odd degree polynomial has at least one real root, since for big (in absolute value) negative x are the values P(x) big (in absolute value) negative, for big positive x are, the values P(x) big positive and since P(x) is a continuous function, it must attain the zero value. One can also argue
X   \ 8760
+ 8760/
X \12       /        a \365 1+ 12)   '    (1 + 365") ' Therefore, we could guess that
ex = lim (l + -Y.
n—s-oo V 71/
At the same time, we can imagine that the finer we apply the interest, the higher the yield will be. So the sequence on the right-hand side should be an increasing sequence. In detail, we examine the sequence of numbers
an = (1 + 1
Bernoulli's inequality will come in handy:
Lemma. For every real number b > — 1, b 0, and a natural number n > 2, (1 + b)n > 1 + nb.
Proof. For 71 = 2,
(1 + b)2 = l + 2b + b2 > l + 2b.
Proceed by induction on 71, supposing b > — 1. Assume that the proposition holds for some k > 2 and calculate:
(1 + b)k+1 = (1 + b)k{\ + b)>{\ + kb){\ + b)
= l + (k + l)b + kb2 >l + (k + l)b
The statement is, of course, true for b = —1 as well. □
Now
an (1 + £)"     _ (ti2 - l)"n
On-l        (l + ^i)™-1 712™(71-1)
i-4
n >(l-1-) n
71-1
71   71 — 1
1.
by using Bernoulli's inequality with b = —^2). So an > a„_ 1 for all natural numbers, and it follows that the sequence an is indeed increasing.
The following similar calculation (also using Bernoulli's inequality) verifies that the sequence of numbers
l + i
71
n+1
71 /   \ 71
is decreasing. Notice that bn > an. Also,
bn+1       71+1  I 2±|
n+1
71
>
71+1 71
1+1 1 +
1 +
n+2
1
71      /ti2 + 271 + 1
71(71 + 2)
71 + 2
71 + 1 V     712 + 2?1 n+2
= 1.
n+2
n + 1 \   ' 71(71 + 2) /
Thus the sequence an is increasing and bounded from above, so the set of its terms has a supremum which equals the limit of the sequence. At the same time, this value is also the limit of the decreasing sequence bn because
lim bn = lim (1 H--)an = lim an.
n—>-oo n—>-oo 71 n—>-oo
328
CHAPTER 5. ESTABLISHING THE ZOO
with the fundamental theorem of algebra (12.2.8). We know, that the polynomial P(x) has to have five roots over the field of complex numbers and that complex roots of the polynomial with real coefficients come in pairs of conjugated numbers. Therefore the polynomial has to have at least one real root.
If there were at least two roots, then according to the mean value theorem (it suffices the Rolle's version, cf. 5.3.8) there must be a c e (a,b), such that P'(c) = 0. But P'(x) = 5a;4 - 4a;3 + 6a;2 - 2a; + 1 = 2a;2(a; - l)2 + 3a;4 + 3a;2 + (x — l)2 > 0. Thus the polynomial has exactly one real root.
□
5.F.12. Let / : R —> (0, oo) be a continuously differen-tiable function. Prove that there exists £ G (0,1) such that
e/(«)/(0)/(*0 = f(l)™.
Solution. We can equivalently transform the equation:
~\f(0)J
f'(0
ln/(l)-ln/(0).
The existence of such a a; is guaranteed by the Lagrage's mean value theorem (cf. 5.3.9) for the function g(x) = ln(/(a;))
□
(there is g' (x) = jffi)-
G. L'Hospital's rule 5.G.I.   Verify that the limit
(a)
(b)
(c)
(d)
sin (2a;) — 2 sin a; 0 lim--;7—--- is of the type -;
x->o 2ex - x2 - 2x - 2 JF 0
In x oo lim - is of the type —;
X-S-0+ cot X oo
lim
1
x-*i+ \ x — 1     In a;
-- is of the type oo — oo;
This limit determines one of the most important numbers in mathematics (besides the numbers 0, 1, and it), namely Euler's number14 e. Thus
e = lim I 1 +
1
5.4.2. Power series for ex. The exponential function has been defined as the only continuous function satisfying
/(l) = e and f(x + y) = f(x) ■ f(y). The base e is now expressed as the limit of the sequence an, thus necessarily
ex = lim (an)x.
n—too
Fix a real number x ^ 0. If we replace n with n/x in the numbers an from the previous paragraph, we arrive again at the same limit. (Think this out in detail!) Hence
e = lim (1 + - V ,    ex = lim (1 + -
71—>-00   V Tl/ 77—>-00   V Tl
Denote the n-th term of this sequence by un(x) = (1 + x/n)n and express it by the binomial theorem:
un(x) = 1 + n—h n
x     n(n — \)x2
2\n2
_l-----1_ .
i\xn
(1)
x   (, 1 1 + *+2! V-n
+
i -
i
i -
i -
n-l
Look at un(x) for very large n. It seems that many of the first summands of un(x) will be fairly close to the values
-k\xk, k = 0,1,____       Thus it is plausible that the numbers
un(x) should be very close to vn(x) = £™=0 and thus both these sequences should have the same limit.
The following theorem is perhaps one of the most important results of Mathematics:
The power series for ex
Theorem. The exponential function ex equals, for each x £ R, the limit lim^oo vk(x) of the partial sums in the expression
1 2 1 n
1 + a; + —x H-----h —;xn + ■
2 nl
oo 1
The function ex is differentiate at each point x and its derivative is (exy = ex.
(e)
(f)
lim (In (x — 1) ■ In x)    is of the type 0 ■ oo;
lim (cot x) i- *    is of the type oo'
x—>0+
I Sill X \ ^
lim ( - ]        is of the type lc
o.
x-s-0 \ X
Proof. The technical proof makes the above idea pre-jjfi i, cise. Fix x and recall that vn(x) is a strictly increasing sequences defined as the sums of the first n terms of the formal infinite expression
14The ingenious Swiss mathematician, physicist, astronomer, logician and engineer Leonhard Euler (1707-1783) was behind extremely many inventions, including original mathematical techniques and tools.
329
CHAPTER 5. ESTABLISHING THE ZOO
(g)
lim (cos — ) is of the type 0U
x—vi— V 2
Then calculate it using 1'Hospital's rule. Solution. We can immediately assert that (a)
lim (sin (2a;) - 2 sin x) = 0 - 0 = 0, lim (2ex -x2 -2x -2) = 2- 0- 0- 2 = 0;
x-s-0
(b)
(c)
(d)
(e)
(f)
(g)
lim Ina; = —oo,        lim cot a; = +oo;
X-S-0+ X-S-0+
X 1
lim - = +oo,        lim--        = +oo;
x-s-i+ x — 1 x-s-i+ lna;
lim Ina; = 0,        lim In (a; — 1) = —oo;
X-S-1 + X-S-1 +       V '
lim cota; = +oo,        lim--        = 0;
X-S-0+ X-S-0+ Ina;
Süll 1
lim- = 1,       lim -TT = +oo;
x-s-0    X x-s-0 X1
lim cos— = 0,        lim lna; = 0. X—Si —        2 x—si—
The case (a). Applying l'Hospital's rule transforms the
limit
into the limit
sin (2a;) — 2 sin a; xAo 2ex - x2 - 2x - 2
lim
lim
2 cos (2a;) — 2 cos a;
x-o     2ex - 2a; - 2 which is of the type 0/0. Two more applications of the rule lead to
lim
-4sin (2a;) + 2sin:
x^o 2ex - 2
and (the above limit is also of the type 0/0)
-8 cos (2x) + 2 cos x     -8 + 2
lim-1—--=-= —3.
x^o 2ex 2
Altogether, we have (returning to the original limit)
sin (2a;) — 2 sin a; lim-1—t.- = —3.
x->o 2ex - x2 - 2x - 2
Let us remark that multiple application of l'Hospital's rule in an exercise is quite common.
From now on, we will set the limits of quotients of derivatives obtained by l'Hospital's rule equal the limits of the original quotients. We can do this if the gained limits on the right sides really exist, i. e. actually we will make sure that what we write is senseful only afterwards.
oo oo ^
3=0
3=0 ■
. The quotient of adjacent terms in the series is cj+i/c, = x/(n + 1). Thus for every fixed x, there is a number N e N such that \cj+i/cj\ < 1/2 for all j > N. However, such large indices j satisfy \cj+1\ < \ \cj\ < 2~(.7~ar+1) |car|.
Recall that sums of a geometric series is computed from
the equality (1 - q)(l + q H-----h qk) = 1 - qk+1. This
means that the partial sums of the first n > N terms of our formal sum can be estimated as follows
JV-l
\vn(x)
E— J I ^ _____ N \ " ___ /!    I < N\X   2- 23'
3=0
3=0
In particular, if x > 0 then the limit of the expressions on the right-hand side for n approaching infinity surely exists, and so the limit of the increasing sequence vn also exists. In particular, for any x and m > n, the difference
i(x) - vn(x)\ < Y
z(\x\) - Vn(\x\)
is a Cauchy sequence and thus is convergent.
Now examine the sequence of numbers un, whose limit is ex. Consider n > N for some fixed N (imagine already N is very large) and choose a fixed number
k < N.
'i Write untk for the sum of the first k summands in the expression 1 for un. Having fixed x and some e > 0, we can choose k big enough to ensure \un^ — un\ < e. Indeed, the absolute values of the omitted terms are less than those in
If N is large enough, \un^ — vk\ < e for all n > N. Indeed, there is only a fixed number of the brackets in the summands of un^ and they will all be arbitrarily close to 1, if n is large enough.
Summarizing, for each e > 0 the choices of k and n lead to the estimate \vk — un\ < 2e. Choosing a sequence of Ei = l/2i, we find subsequences vkt and un% satisfying \vk, — un,\ < j. Thus the two convergent sequences must both have the same limit:
lim vk = lim un.
k^-oo n—Yoo
This is the first claim we had to prove.
It remains to compute the derivative of ex in the origin. We need to consider
lim
x-s-0
(1 + x + \x2 + ...)- 1
= lim- ' 2
x-s-0
X x-s-0 X
This seems to be tricky, since there are two limit expressions there (notice that an x may be cancelled since this is a constant in the inner limit):
lim ( lim       ^-xk 1 I = 1 + lim ( lim ^-x
\ n—soo —' h\ I x—YO yn—soo —' k\
x—YO \ n—Yoo ^-—^ k\
V k=l
k-1
330
CHAPTER 5. ESTABLISHING THE ZOO
The case (b). This time, differentiation of the numerator and the denominator gives
1 1 -2
ma;       ,.        - ,.     —sin x
am - = lim —— = lim
x-s-o+ cot a; X-S-0+
X-S-0+ x
The last limit can be determined easily (we even know it). From
lim — sin a; = 0,
X-S-0+
S1M
lim - = 1,
X-S-0+ X
the result 0 = 01 follows. We could also have used l'Hos-pital's rule again (now for the expression 0/0), obtaining the result
— sin2 x             — 2 ■ sin x ■ cos x     —2 ■ 0 ■ 1 lim - = lim - = - = 0.
X-S-0+        X X-S-0+ 1 1
The case (c). By mere transforming to a common denominator:
lim
1  \ x lna; — (x — 1)
= lim
x-s-n- \x — 1     Inx)     x-s-i+    [x — l)lna;
we have obtained the type 0/0. We have that
a;lna; - (a; - 1)      ,      lnx + f-l
lim —-.-^---        = lim ——:—--
x^i+    (x — 1 ma;       x^i+ £-_L+ma;
V ' X
In a;
= lim
x^i+ 1 - i + In a;
X
We have the quotient 0/0, which (again by l'Hospital's rule) satisfies
i
lim -=-= lim
1 1
x'-^T+ 1 - ± + lna;    *"AT+ -L + A     1 + 1     2'
X XA X
Returning to the original limit, we write the result
,    x 1 \ 1
lim
x-s-n- \x — 1     In a;/ 2
The case (d). We transform the assigned expression into the type oo/oo (to be precise, into the type — oo/oo) by creating the fraction
lim In (x — 1) ■ In x = lim ——.--.
X-S-1 + X-S-1 + V—
m x
By l'Hospital's rule,
lim ——:--       = lim
l
x-l
-x In2 x
X-S-1 +
X-S-1 + —-
lim
x^tl+    X — 1
In x In2 x x
This indeterminate form (of the type 0/0) can once again be determined by l'Hospital's rule:
lim
-a; In2 a; -In2 x - 2x Inx ■ ±        0 + 0
lim
X-S-1+  X — 1 X-S-1 +
1
1
0.
Next, for each e > 0 we can find iV such that lim^oo Y1=n hxk < & for all a; G [-1,1]. Next we restrict the interval for x enough to ensure that the
remaining first terms yield Y^k=2 ~tixk < £< to0- This shows that the limit expression on the right-hand side must be zero. Thus the derivative is to one, as expected. □
Readers who skipped the preceding paragraphs (it doesn't matter whether on purpose or in need) can stay calm - we deduce all the results on the exponential function later again, using more general tools. In particular, we will see that all power series are always differentiable and can be differentiated term by term. We see later that the conditions f'(x) = j(x) and /(0) = 1 determine a function uniquely.
5.4.3. Number series. When deriving the previous theo-■ rems about the function ex, we have automatically used several extraordinarily useful concepts and tools. Now, we come back to them in detail:
Infinite number series
Definition. An infinite series of numbers is an expression
71 = 0
a0 + cii + a2 + ■ ■ ■ + cik + .
where the a„'s are real or complex numbers. The sequence of partial sums is given by the terms sk = J2n=o an- ^e series converges and equals s if the limit
s = lim s„
of the partial sums exists and is finite.
If the sequence of partial sums has an improper limit, the series diverges to oo or —oo. If the limit of the partial sums does not exist, the series oscillates.
5.4.4. Properties of series. For the sequence of partial sums sn to converge, it is necessary and sufficient that it is a Cauchy sequence; that is
\Sm      Sn        |^n+l + ' ' ' + CLm\
must be arbitrarily small for sufficiently large m > n. Since
\an+i + ' ' ' + |«m  > I«71+1 + ' ' ' + o,m\,
the convergence of the series J2k=o la«l implies me convergence of the series Y/k=o a™-
Absolutely convergent series
A series J^fcLo a™ Is absolutely convergent if the series 127=0 Kl converges.
Absolute convergence is introduced because it is often much easier verified. The following theorem K ~~±Zj      shows that all simple algebraic operations be-have "very well" for series that converge absolutely.
331
CHAPTER 5. ESTABLISHING THE ZOO
The cases (e), (f), (g). Since
lim (cotx)'n* = ex^0+
x^0+
lim
Properties of series
Theorem. Let S = J2n°=o an and T = J2n°=o bn be two
absolutely convergent series. Then
(1) their sum converges absolutely to the sum
oo oo oo
lim I cos
x-n- V 2
irx\^nx lim (lnx-ln(cos ^))
it suffices to calculate the limits given in the argument of the exponential function. By l'Hospital's rule and simple rearrangements, we get
ln(cota;) [~      +001 1 _1
lim
x-*o+ mx
= lim
type
lim
cot x    sin2 x
aj-t-04- cos a; ■ smi
-1
= -1;
type ■
0
X-S-0+ I
= lim
-1
a->0+ cos2 x — sin x
1 - 0
\q sin x
lim-if—
x-s-0 X1
type
X         X COS X — sin X
lim
2a;
lim
a; cos x — sin a;
a-t-o     2a;2 sin a;
cos a; — a; sin a; — cos x
0
type-
S + T =Yan + Ybn = E^™ + bn^ n=0 n=0 n=0
(2) their difference converges absolutely to the difference
00 00 00
S -T = ^o„ -Ybn = E(a™ ~
n=0 n=0 n=0
(3) their product converges absolutely to the product
/oo       \       /oo       \ 00    / n
^^^lE^j'lE^j^ElE an_kbk
\n=0      J     \n=0     J       n=0 \k=0
(4) the value S of the sum does not depend on any rearrangement of the series, i.e., YLn°=oao{n) — S for any permutation a : N —> N of integers.
Proof. Both the first and the second statements are a straightforward consequence of the corresponding properties of limits. The third statement is not so simple. Write
= lim
i-s-o  Ax sin x + 2x2 cos x — sin a;
lim —
x-s-o 4 sin x + 2x cos x
— cos x
0
type-
lim
-1
1
i-s-o 4 cos x + 2 cos x — 2x sin x 4 + 2 — 0 6' hence
lim (cot a;)ln 1 = e
x^0+
lim
1
i-s-o \  x   j \/e We can proceed similarly when determining the last limit. We have that
lim (In a;) ■ In ( cos —
= lim
= lim
X—Yl-
In (cos
type ■
—00 00
— —I— ■ i
ln2 x x
7r         a; sin ?-h 1 = - lim ---.
2 x-H-        cos ^
Since this form is of the type 0/0, we could continue by using l'Hospital's rule; instead, we will go from
x sin     ■ In2 x
lim -^-
i-tl- cos^r
fc=0
From the assumptions and from the rule for the limit of a product,
k       \      / k
Yan ■ (YK)(Ean)■ (EK
Vn=0      /      \n=0     / \n=0     / \n=0
Thus it suffices to prove that
\ \n=0     /      \n=0     / n=0
Consider the expressions
Ea« ■ E&« = E aA->
Vn=0     /      \n=0     / 0<i,j<k k
cn= y aibji YjCn= E aibi-
i+j=n n=0 i+j<k
0<i,j<k 0<i,j<k
along with the bound
k       \      / k       \ k
Y a« ■ Ebn I - E c« = E
Vn=0     /      \n=0     /       n=0 i+j>fc
0<i,j<k
Y iaA'
0<j,j<fc
If the sum of the indices is to be larger than k, then at least one of them must be larger than k/2. The expression does not decrease if more terms are added into it. Take all as in
332
CHAPTER 5. ESTABLISHING THE ZOO
over to the product of limits
TTX N
lim (x sin
x-n- \ 2
lim
In2 x
1 ■ lim
In2 x
the product and remove only those whose indices are both at most k/2.
Only now we apply l'Hospital's rule for
In2 x
lim
x-s-l- cos -
type
lim
21na:
(-f)sin^
= 0.
Altogether, we have
lim (In x ■ In (cos — x-n- V V 2
= ^■•1-0 = 0,
l. e.
In x
lim (cos — ) = e° = 1. x-n- \ 2
□
5.G.2. As we have implicitly mentioned, using l'Hospital's rule can lead to a non-existing limit even though the original limit exists: Determine the limit
lim
x—too
x + sin x
x
Solution. The limit is of the type ^, by l'Hospital's rule, we get that
lim
x—too
x + sin x
■ lim
x—soo
1 + COS X
X x—s-oo 1
and since the limit lim^oo cos x does not exist, nor does the limit limx^oo because
1 + cos a;. However, the original limit exists
x — 1     x + sin x     x + 1 - < - < -,
XXX
and by the squeeze theorem,
x + sin x            x + sin x            x + 1 1 = lim - < lim - < lim - = 1.
x—too X x—too X x—too X
□
5.G.3. Determine
lnz lim -,
x—S-+00 x
lim x In —,
I-S-0+ X
lim x e x ,
X-S-0+
E
i+j>k 0<i,j<k
\a{bj
< E
0<i,j<k
*bj\- E
0<i,j<k/2
a{bj\
However, both the expressions of the difference are the partial sums for the product S ■ T. Therefore, they share the same limit and their difference goes to zero.
The last claim seems to be a little tricky. Notice, for each small e > 0 we can find a common bound a such that for all iV > a both estimates are true:
E
E an - S\ <
Now, consider any permutation a of the indices and write
clearly
.,<t 1(a)}. Then, for each N > max JCT
N NN
E a<*(n) ~ ^1 = E a<*(n)— s + E a<*(n)
n=0 n£l0 n$Llc
a N
< Ea™_ + E I'm™)
n=0 n£Ia
lim x e
x-s-O-
lim
x-*0 X
100
lim   (In x — x)
x—s+oo
lim
lim
lim
x-s-+oo x + In X ■ COS X    x-t+oo fyx + 3    X-S-+00 yjx2 + 1
Solution. It can easily be shown (for instance, by n-fold use of l'Hospital's rule) that for any n e N, it holds that
lim  — = 0,   i. e.     lim  — = +oo.
x—s+oo Qx x—s+oo xn
The squeeze theorem implies the following generalization for real numbers a > 0:
lim  — = 0,   i. e.
x—s+oo Qx
lim  — = +oo.
x—s+oo xa
Next, notice that n <£ Ia means a(n) > a. Thus, the latter term is at most equal to J2^+i an and thus the entire express-sion is bounded by 2e. This shows that the rearranged series converges to the same value S again. □
5.4.5. Simple tests. The following theorem collects some useful conditions for deciding on the convergence of series.
333
CHAPTER 5. ESTABLISHING THE ZOO
Taking into account that the graphs of the functions y = ex
and y = In a (the inverse function to y = ex) are symmetric
with regard to the line y = x, we further see that
In a; a
hm -= 0,   i. e.     hm--= +00.
i-s-+oo  x i-s-+oo In a
Thus we have obtained the first result. That could also
be derived from 1'Hospital's rule because
Convergence tests
In a
1
lim -=   lim  — =   lim   — = 0.
x—s+oo    X x—s+oo 1 x—s+oo X
Let us point out that l'Hospital's rule can be used to calculate all of the following five limits. However, it is possible to determine these limits by much simpler means. For instance, the substitution y = \/x leads to
lim a In — =   lim —— = 0;
aj-t-04-        a      j/-)-+oo y
1 &
lim aex =   hm  — = +00.
x—Y0+ y—s+oo y
Of course, a -+ 0+ gives y = 1/a -+ +00 (we write l/ + 0 = +oo).
By the substitutions u = — 1/a, v = 1/a2 we get that, respectively,
lim a e
x-s-O-
lim
lim--
u—s+oo u
„50
x-s-o a
100
lim —
V—>- + oo Qv
0,
where a -+ 0— corresponds to u = — 1/a -+ +00 (we write —1/ — 0 = +00) and then a^0tow = l/a2^ +00 (again 1/ + 0 = +00). We have also clarified that
lim   (In a — x) =   lim  —a = —00.
x—>-+oo x—>-+oo
Potential doubts can be scattered by the limit
In a — a / a
lim
lim    1 ,
i-s-+oo V ma
-00,
i-s-+oo    In a
which proves that even when decreasing the absolute value of the considered expression (without changing the sign), the absolute value of the expression remains unbounded. We can equally easily determine
lim
=   lim  — = 1;
i-s-+oo a + In a ■ cos a    x-t+oo a sYx~TT &x~
lim
lim
i-s-+oo + 3       x-s-+oo -tyx
X X
lim
x^+00 ^x2 _|_ I x
= lim
+00; = 1.
We have seen that the l'Hospital's rule may not be the best method for calculating limits of types 0/0, 00/00. The three preceding exercises illustrate that it even cannot be applied in all cases (for indeterminate forms). If we had applied it
Theorem. Let S = J_^o a™ ^e an infinite series of real or complex numbers. Let T = J^^Lo be another series with all bn > 0 real.
(1) If the series S converges, then limn_j.00 an = 0.
(2) (The comparison test) IfT converges and \an\ < bn, then S converges absolutely. If bn < \an\ and T diverges, then S does not converge absolutely.
(3) (The limit comparison test). If both an and bn are posi-
tive real numbers and the finite limit limn_j.00 |^
■ r >
0 exists, then S converges if and only ifT converges. (4) (The ratio test) Suppose that the limit of the quotients of adjacent terms of the series exists and
-»71+1
lim
77—>-00
Then the series S converges absolutely for \q\ < 1 and does not converge for \ q\ > 1. If\q\ = 1 the series may or may not converge. (5) (The root test) If the limit
lim \/fa^j = q
77—>-oo
exists, then the series converges absolutely for q < 1. It does not converge for q > 1. If q = 1, the series may or may not converge.
Proof. (1) The existence and the potential value of the limit of a sequence of complex numbers l-t-li'/, is given by the limits of the real parts and the imaginary parts. Thus it suffices to prove the first proposition for sequences of real numbers. If linijj^oo an does not exist or is non-zero, then for a sufficiently small number e > 0, there are infinitely many terms ak with \ak\ > e. There are either infinitely many positive terms or infinitely many negative terms among them. But then, adding any one of them into the partial sum, the difference of the adjacent terms sn and s„+i is at least e. Thus the sequence of partial sums cannot be a Cauchy sequence and, therefore, it cannot be convergent.
(2) The result is a straightforward consequence of the squeeze theorem, cf. 5.2.12.
(3) Since limit r = lim^oo ^ exists, for any given e > 0 and sufficiently big n > N£,
(r-e)bn <a„< (r + e)bn.
Thus, after choosing e < r it follows that an < (r + e)bn and bn < -rzian- The result follows from the previous claim (2).
(4) To prove absolute convergence, it can be assumed that the terms of the series are real numbers an > 0. Suppose q < r < 1 for a real number r. From the existence of the limit of the quotients, for every j greater than a sufficiently large N,
LlJ + l
< r ■ a
N+l)
334
CHAPTER 5. ESTABLISHING THE ZOO
to the first problem, we would have obtained, for x > 0, the quotient
1 x
where the last equality follows from the general equality (1 —
1 +        — In x ■ sin x     x + cos x — x In x ■ sin x'
x
which is more complicated than the original one. The limit for x —> +oo does not even exist, so one of the prerequisites of l'Hospital's rule is not satisfied. In the second case, any number of multiple uses of l'Hospital's rule leads to indeterminate forms. For the last problem, l'Hospital's rule sends us back to the original limit: first it gives the fraction
1 x/x2 + l
and then
i x/x^TT'
From here, we can deduce that the limit equals 1 (we are looking for a non-negative real number aeR such that a = a-1) only if we have already shown it exists at all. □ Other examples concerning calculation of limits by l'Hospital's rule can be found at page 355.
H. Infinite series
Infinite series naturally appear in a series (of problems).
5.H.I. Sierpinski carpet. The unit squares is divided into nine equal squares and the middle one is removed. Each of the eight remaining squares is again divided into nine equal subsquares and the middle subsquare (of each of the eight squares) is removed again. Having applied this procedure ad infinitum, determine the area of the resulting figure.
Solution. In the first step, a square having the area of 1/9 is removed. In the second steps, eight squares (each having the area of 9~2, i. e. totaling to 8 ■ 9~2) are removed. Every further iteration removes eight times more squares than in the previous steps, but the squares are nine times smaller. The sum of areas of all the removed squares is
I + J. + Si _i_____ _S^_
9 + 92 + 93 +       ~ 9"+i •
n=0
The area of the remaining figure (known as Sierpinski carpet) thus equals
71 = 0
* = i-*£(§)
77=0
= 1 -
= 0.
r)(l + r2 +
+ rK
1 — rk+1. But this means that the
partial sums sn are, for large n > N, bounded from above by the sums
N 77-iV JV
sn < ^ aj + aN ^ r7 = ^ a,j + aN-
3=0
3=0
1 -r
□
Since 0 < r < 1, the set of all partial sums is an increasing sequence bounded from above, and thus its limits equals to its supremum.
In the case q > r > 1, a similar technique can be used. However, this time, from the existence of the limit of the quotients,
aj+1 > r ■ a, > r{]-N+1) aN > 0.
This implies that the absolute values of the particular terms of the series do not converge to zero, and thus the series cannot be convergent, by the already proved part (1) of the theorem.
(5) The proof is similar to the previous case. From the existence of the limit q < 1, it follows that for any r,q < r < 1, there is an N such that for all n > N, \/\an\ < r holds. Exponentiation then gives \an\ < r", there is a comparison with a geometric series. Thus the proof can be finished in the same way as in the case of the ratio test. □
In the proofs of the last two statements of the theorem, a much weaker assumption is used than the existence of the limit. It is only necessary to know that the examined sequences of non-negative terms are, from a given index on, either all larger or all less than a given number.
For this purpose however, it suffices to consider, for a given sequence of terms bn, the supremum of the terms with index higher than n. These suprema always exist and create a non-increasing sequence. Its infimum is then called upper limit of the sequence and denoted by
limsup bn.
77—>-00
The advantage is that the upper limit always exists. Therefore, we can reformulate the previous result (without having to change the proof) in a stronger form:
Corollary. Let S = Et^=o a« ^e an infinite series of real or complex numbers.
(1) V
«77+1
q = lim sup - ,
77—>-00 an
then the series S converges absolutely for q < land does not converge for q > 1. For q = 1, it may or may not converge.
(2) If
q = lim sup \/|an|,
77—>-00
the series converges absolutely for q < 1 while it does not converge for q > 1. For q = 1, it may or may not converge.
335
CHAPTER 5. ESTABLISHING THE ZOO
5.H.2. Koch snowflake, 1904. Create a "snowflake" by the following procedure: At the beginning, consider an equilateral triangle with sides of length 1. With each of its three sides, do the following: Cut it into three equally long parts, build another equilateral triangle above (i. e. pointing out from, not into, the original triangle) the middle part and remove the middle part. This transforms the original equilateral triangle into a six-pointed star. Once again, repeat this step ad infinitum, thus obtaining the desired snowflake. Prove that the created figure has infinite perimeter. Then determine its area.
Solution. The perimeter of the original triangle is equal to 3. In each step, the perimeter increases by one third since three parts of every line segment are replaced with four equally long ones. Hence it follows that the snowflake's perimeter can be expressed as the limit
dn = 3 (I)"   and    lim dn = +oo.
VJ/ n—s-oo
The figure's area is apparently increasing during the construction. To determine it, it thus suffices to catch the rise between two consecutive steps. The number of the figure's sides is four times higher every step (the line segments are divided into thirds and one of them is doubled) and the new sides are three times shorter. The figure's area thus grows exactly by the equilateral triangles glued to each side (so there is the same number of them as of the sides). In the first iteration (when creating the six-pointed star from the original triangle), the area grows by the three equilateral triangles with sides of length 1/3 (one third of the original sides' length). Let us denote the area of the original equilateral triangle by So. If we realize that shortening an equilateral triangle's sides three times makes its area decrease nine times, we get
5*0 + 3- f.
for the area of the six-pointed star. Similarly, in the next step we obtain the area of the figure as
5*0 + 3- f+4-3 - f.
Now it is easy to deduce that the area of the resulting snowflake equals the limit
lim (S*o + 3 ■ f + 4 ■ 3 ■ § + ■ ■ ■ + 4™ ■ 3 ■
5o lim (1 + i + i ■ | + ■ ■ ■ + i
3    V9 J
5.4.6. Alternating series. The condition an —> 0 is a necessary but not sufficient condition for the convergence of the
series J2n-.
However, there is the Leibniz criterion of
convergence.
Leibniz criterion for alternating series
The series Y^n°=o(~l)™a™> where an is a non-increasings sequence of non-negative real numbers, is called an alternating series.
Theorem. An alternating series converges if and only if limn_j.00 an = 0. Its value a = Y^=o(~ l)™a™ differs from the partial sum s2k by at most a2k+i.
Proof. By the definition the partial sums sk of an alternating series satisfy
s2(fc+l)+l = S2k+1 + a2k+2 — &2k+3 > S2fc+1 s2(k+l) = s2k — «2fc+l + «2fc+2 < S2k S2k+1 — S2k = — «2fc+l —> 0 S2 > S2k > S2k+1 > S1.
Thus, the even partial sums are a non-decreasing sequence, while the odd ones are non-increasing. The last line reveals that the bounded sequence of the odd partial sums converges to its supremum, while the even ones converge to the inn-mum. The previous line says they coincide, if and only if lim^oo an = 0 which proves the first claim.
At the same the limit value a of the series is always at most s2k+1 and at least s2k. Thus, the latter partial sums cannot differ by more than a2k+i. □
Remark. As obvious from the latter theorem, convergent alternating series are often not convernging absolutely. This phenomenon is called conditionally converging series. Unlike the independence on the order in which we sum up the terms of an absolutely converent series (cf. (4) of Theorem 5.4.4), there is the famous Riemann series theorem saying that conditionaly convergent series can be braught to any finite or infite value by appropriate rearrangement of the terms in the sum. We shall not go into the proof here.
5.4.7. Convergence rate. The proofs of the tests derived in the previous two paragraphs allow also for straightforward estimates of the speed of the convergence. Indeed, both the tests for the absolute convergence are based on the comparison with the geometric series either for q = lim sup^^ |
or q = limsup^oo \/\an\, and 0 < q < 1. In the estimate of the error of approximation of the limit by the n-th partial sum s„
\an\
\aN\r
1
3=0
1-r
Crn
So 1 + 1 lim     + i +
where N and q < r < 1 are the two related choices from the proof of the test and C is the resulting constant non dependent of n. Thus the convergence rate is quite fast, in particular if
336
CHAPTER 5. ESTABLISHING THE ZOO
i + _     E (I)
1 + 1
fc=0
The snowfiake's area is thus equal to 8/5 of the area of the original triangle, i. e.
8 c   _ 8    v_ _ 2 s/3 5 b° ~ 5 '   4   -    5 '
Let us notice that this snowflake is an example of an infinitely long curve which encloses a finite area. □
5.H.3.   Show that the so-called harmonic series
oo 1
E1
77=1
diverges.
Solution. For any natural number k, the sum of the first 2k terms of this series is greater than k/2:
1     11 1111 1+2+ 3 + 4 +5 + 6 + 7 + 8+-"
as the sum of the terms from 2 + 1 to 2 + is always greater than 2(-times (its number) l/2( (the least one of them), which sums to 1/2. □
5.H.4. Determine whether the following series converge, or diverge:
oo
i) E^
71=1
_____
Ü) E
n=l
oo
100000
n=l
oo
iv) E
77=1
Solution.
l
(l+i
i) We will examine the convergence by the ratio test:
2(n + l) lim    V       ' = 2 > 1,
	ö77+1		2n+i
lim		= lim	n+1 271
77—SOO		77—soo	n
so the series diverges.
ii) We will bound the series from below: we know that - <
y 77 —
for any natural number n. Thus the sequence of the partial sums sn of the examined series and the sequence of the partial sums s'n of the harmonic series satisfy:
77   ^       n l
s« = E ~i= - E ~ =
i=i v 7=1 Since the harmonic series diverges (see the previous exercise), by definition, the sequence of its partial sums {s'n}^=1 diverges as well. Therefore the sequence of its
k is much smaller than 1 (and we can get k as close to q as necessary).
On the other hand, the proof of the alternating series test shows that the convergence rate is at least as fast as the convergence of the terms an.
5.4.8. Power series. If we consider not a sequence of num-iry    _    bers an, but rather a sequence of func-r ^'^^f^S ti°ns fn(x) sharing the same domain A,
we can use the definition of addition of ■iy       series "point-wise", thereby obtaining the concept of the series of functions
00
77=0 Power series
A power series is a series of functions given by the expression
00
S(x) = Y anxn 77=0
with coefficients an e C, n = 0,1,....
S(x) has the radius of convergence p > 0 if and only if S(x) converges for every x satisfying \x\ < p and does not converge for |x| > p.
5.4.9. Properties of power series. Although a significant part of the proof of the following theorem will have to be postponed until the end of the following chapter, formulation of the basic properties of the power series can be considered now. Notice that the upper limit r = limsup.^^ \/\an\ equals the limit limsup^..^^ \/\an\, whenever this limit exists.
Convergence and differentiation
Theorem. Let S(x) = _>3^o anXn be a power series and let
r = lim sup \J I an |.
77—soo
Then the radius of convergence of the series S is p = r_1.
The power series S(x) converges absolutely on the whole interval of convergence and is continuous on it (including the boundary points, supposing it is convergent there). Moreover, the derivative exists on this interval, and
S<(x) = Y><
Proof. To verify the absolute convergence of the series, use the root test from theorem 5.4.5(3), for every value of x. Calculate (if the limit exists)
lim \/\anxn\ = r\x\.
77—soo
Either the series converges absolutely, or it does not converge if this limit is different from 1. It follows that it converges for
337
CHAPTER 5. ESTABLISHING THE ZOO
partial sums {sn}n°=1 also diverges and so does the examined sequence.
iii) This series is divergent since it is a multiple of the harmonic series.
iv) The examined series is geometric, with common ratio ytpj. Such a sequence is convergent if and only if the absolute value of the common ratio is less than one. We know that
V2
1
1
11 1 . 2 ~ 2*
1 1 4 + 4
< 1,
1+j      1   2   1     '2    2'     V4    4 2 hence the series converges, and we are even able to calculate it:
~ 1 1 1+2
^ (l + i')n 1-TX-
77=1   V ' 1 + 7
1
□
5.H.5.   Calculate the series
/77+t
(b) E £;
oo
(c) E
_3__i _2_
2n-l   ~T á2n
(d) E
(e) E
77 = 0
(3t7+1)(3t7+4) ■
Solution. The case (a). From the definition, the series is equal to
^ (X
1
/77+t
77=1
J™ ((-k--k) + (-k--k) + --- + (-k
n^+oo \ +(.     %/2 + \/2) +
= 1.
+
\ľň s/ň
M+l 1
/77+1
The case (b). Apparently, this sequence is a quintuple of the standard geometric series with the common ratio q = 1/3, hence
oo oo
1
l-
15 2 '
77 = 0 77 = 0
The case (c). We have that (with the substitution m
n-1)
E
_3__i _2_
2n-l   ~T A2n
1
77=1
{3
4  E  V42"-2■
77=1
+ 16  £ (42"-2/
77 = 1
- + -2-1
4T 16Í
1 _        14  \^ ('J-'!"1 —
?-<  42m — 16 U6/ —
777 = 0 777 = 0
14
15 '
16 i^w
The series of linear combinations was expressed as a linear combination of series (to be more precise, as a sum of series
\x\ < p and diverges for |x| > p. If the limit does not exist, use the upper limit in the same way.
The statements about continuity and the derivatives are proved later in a more general context, see 6.3.7-6.3.9. □
5.4.10. Remarks. If the coefficients of the series increase rapidly enough, (for example an = nn), then r = oo. Then the radius of convergence is zero, and the series converges only at x = 0.
Here are some examples of convergence of power series (including the boundary points of the corresponding interval): Consider
oo oo 1
S(x) = -£xn,   T(x) = YJ-xn-
77 = 0 77=1
The former example is the geometric series, which is already discussed. Its sum is, for every x, \x\ < 1,
while \x\ > 1 guarantees that the series diverges. For x = 1, we obtain the series 1 + 1 + 1 + ..., which is divergent. For x = —1, the series is 1 — 1 + 1 — ..., whose partial sums do not have a limit. The series oscillates.
Theorem 5.4.5(2) shows that the radius of convergence of the series T(x) is 1 because
77+1
lim
77+1-
lim
71 + 1
For x = 1, the series 1 + \ + | + ..., is divergent: By summing up the 2fe_1 adjacent terms l/2fe_1,..., l/(2fe -1) and replacing each of them by 2~k (thus they total up to 1/2), the partial sums are bounded from below by the sum of these 1/2's. Since the bound from below diverges to infinity, so does the original series.
On the other hand, the series T(—1)
-1 + 1-
r + .
converges although of course, it cannot converge absolutely. Of course, this is true since we deal here with an alternating series.
Notice that the convergence of a power series is relatively fast near x = 0. It is slower near the boundary of the convergence interval.
5.4.11. Trigonometric functions. Another important observation is that a power series is a series of numbers for each fixed x and the individual terms make sense for 1 complex numbers x e C. Thus the domain of converts 1 gence of a power series is always a disc in the complex plane C centered at the origin.
More generally, we can write power functions centered at an arbitrary (complex) point xq,
oo
S(x) = ^ an(x - xoT,
77 = 0
which converge absolutely again on the disc of radius p,
p    — limsupn_
\anI, but this time centered at x0.
338
CHAPTER 5. ESTABLISHING THE ZOO
with factoring out the constants), which is a valid modification supposing the obtained series are absolutely convergent. The case (d). From the partial sum
Sn = 5 + Jr + |r + --- + !r, neN, we immediately get that
t = F + * + "- + l^ + 3^ rieN. Therefore,
bn       3—3 + 32-1-33-!-        1   371      3„+i,     « t
Since lim ^irr = 0, we get that
71 —>-00
OO 71
S^ = „I™ i(«"-^) = iJ™ =
3
4'
The case (e). It suffices to use the form (this is the so-called partial fraction decomposition)
(3ti+1)1(3ti+4) = 3 ' 3y7+T ~~ 3 ' 3nTi> G N U {0},
which gives
OO
S (3ti+1)(
n=0
(3n+l)(3n+4)
■ "™ 311      4^4       7^7       10 ^       ^ 3n+l 3n+4
= lim At   1 — t4ju ) = i
n^oo 3 V        371+4; 3
5.H.6.   Verify that
OO OO 71=1 71 = 0
Solution. We can immediately see that
1^1,      22 + 32 ^        22 — 2'
42   +  52   + g2   + 72   ^ ^ 42—45
or, in general:
1
□
Hence (by comparing the terms of both of the series) we get the wanted inequality, from which, by the way, it follows that the series Y^n°=i    converges absolutely.
Eventually, let us specify that
00 9 00
n=l n=0
□
5.H.7.   Examine convergence of the series
00
y iii2±i.
71
Earlier we proved explicitly (by a simple application of the ratio test) that the exponential function series converges everywhere. Thus this defines a function for all complex numbers x.
Its values are the limits of values of (complex) polynomials with real coefficients and each polynomial is completely determined by finitely many of its values. In particular, the values of each series are completely determined on the complex domain by their values at the real input values x. Therefore, the complex exponential must also satisfy the usual formulae which we have already derived for the real values x. In particular, ieC
ex+y =ex.eV^
see (2) and the theorem 5.4.4(3).
S.ubstitute the values x = it, where i e C is the imaginary unit, ieR arbitrary.
1.9 .1
1
1
i + it —r - i—r + —r + i—t
3! 4!
5!
The conjugate number to z = elt is the number z = e lt. Hence
All the values z = elt lie on the unit circle centered at the origin, in the complex plane.
The real and imaginary parts of the points lying on the unit circle are named as the trigonometric functions cos 6 and sin 6, where 6 is the corresponding angle.
/
ski
Differentiating the parametric description of the points of the circle t i-> erf, gives the vectors of "velocities" which are easily computed. Differentiating the real and imaginary parts separately (assuming that the real power series can be differentiated term by term) gives :
1
1
(eay = (i--t2 + -ti...y + i(t
4!
1  3      1 5
= -('-3l^+5l' ■
1 , 1 , 3! +5!
which means (erf)' = i ■ elt. So the velocity vectors all have unit length. Hence the entire circle is parametrized if t is moved through the interval [0,27r], where 27r stands for the length of the circle (a thorough definition of the length of a curve needs integral calculus, which we will develop in the next chapter). In particular, this procedure of parameterizing the circle can be used to define the number it, also called
339
CHAPTER 5. ESTABLISHING THE ZOO
Solution. Let us try to add up the terms of this series. We have that
oo
V In s±i = iim (In I + In § + In I + ■ ■ ■ + In =±I)
77=1 n^oo
= lim In 2-3;49;("+1) = lim In (n + 1) = +oo.
77^00 1-2-3-n V >
Thus the series diverges to +oo.
□
5.H.8.   Prove that the series
E arctg
77 = 0
do not converge. Solution. Since lim arctg
772+277+3%/f7+4 .
77+1 '
3" + l
773+772—77
772+277 + 3^/77 + 4 77+1
lim arctg v = f
and
lim   3:T+1 =
n^m 77J+772-77
lim ^ = +oo,
77—>-00 71
the necessary condition  lim an   =   0 for the series
77—>-00
5Zt^Lt70 a77to converge does not hold in either case. □ 5.H.9.   What is the series
oo
y  1 n
77 = 2
Solution. From the inequalities (consider the graph of the natural logarithm)
1 < Inn < n,    n > 3, n £ N,
it follows that
\/T < \/lnn < tfn,    n > 3, n £ N. By the squeeze theorem,
lim Á/lnn = 1,   i. e.
lim -
77—>-00
= 1.
Thus the series does not converge. As its terms are non-negative, it must diverge to +oo. □
5.H.10.   Determine whether the series
oo
(a) E (n+l)-3" >
77 = 0
oo „
(b) E
n=l
oo
rc) _-_
V   ' 77 —In 77
77=1
converge.
Solution. All of the three enlisted series consist of non-negative terms only, so the series is either finite (i. e. converges), or diverges to +oo. We have that
oo oo
(a) E < E (if = T^x < +°°;
n=0 n=0 3
OO        2 OO        2 OO
(b) E ^ > E     = E £ = +oo;
(c) E —k— > E - = +
V   '     t-^   77 —In 77 - t-^ 77
77 = 1
OO.
Archimedes' constant or the Ludolphian number 15 half the length of the unit circle in the Euclidean plane R2. It can be found by computing the first positive zero point of the imaginary part of eJ t.
For example, use the 10th order approximation cosi ~ 1 - (l/2)i2 + (l/24)i4 - (l/720)i6 + (l/40320)f8 -(l/3628800)i10. Ask Maple to find the first positive root. The result is tt ~ 3.14159172323226, for which the first 5 decimal points are correct. Compare the result 3.184900868 from the approximation of order 4.
The explicit representation of trigonometric functions in terms of the power series is now apparent:
1 2       1   , 1
2 +4! 6f
cos í = Reeíť = 1 - ±í2 + 4ri4 - ^t6+
+ (-1)*
1
(2*0
t2k +
siní = line1' = t - ^-t3 + ^-t5 - —,t7+ 3!       5! 7!
1
(2fc + l)!
t2k+l +
A
Sf
The following diagram illustrates the convergence of the series for the cosine function. It is the graph of the corresponding polynomial of degree 68. Drawing partial sums shows that the approximation near zero is very good. As the order increases, the approximation is better further away from the origin as well.
The well-known formula
eu e~u = sin2 t + cos2 t = 1
is immediate. From the derivative (erf)' = ielt it follows that
(siní)' = cos í,       (cosi)' = — siní
15This number describes the ratio of the circumference to the diameter of an (arbitrary) circle. It was known to the Babylonians and the Greeks in ancient times. The term Ludolphian number is derived from the name of German mathematician Ludolph van Ceulen of the 16th century, who produced 35 digits of the decimal expansion of the number, using the method of inscribed and circumscribed regular polygons, invented by Archimedes.
340
CHAPTER 5. ESTABLISHING THE ZOO
Hence it follows that the series (a) converges; (b) diverges to +00; (c) diverges to +00. □ More interesting exercises concerning series can be found at page 356.
I. Power series
In the previous chapter, we examined whether it makes sense to assign a value to a sum of infinitely many numbers. Now we will turn our attention to the problem what sense the sum of infinitely many functions may have.
5.1.1.   Determine the radius of convergence of the following
power series:
00
D E 2i*n
71=1
OO
n) E
71=1
Solution.
TT+Tř
i) From we get that
1
lim sup
Thus the power series converges exactly for the real numbers x e (—5,5) (alternatively, the complex numbers I a; I < |). Let us notice that the series diverges for a; = | (it is harmonic), but on the other hand, it converges for x = — \ (alternating harmonic series). To determine the convergence for any x lying in the complex plane on the circle of radius \ is a much harder question which goes beyond our lectures.
ii)
r = lim sup
(1 +i)n
= lim sup
1+2
V2 2 '
□
5.1.2. Determine the radius r of convergence of the power series
(a) E ^7
71=1
OO
(b) E {-Anfxn;
71=1
OO 2
(O E (i + á)n
d) E (2+pry
. Ith
(a)  lim y'l an \ = lim = í-
n—yoo 77—yoo v n ° 0
77=1
Solution. It holds that
by considering real and imaginary parts. Let to denote the smallest positive number for which e~lt° = — elt°. t0 is the first positive zero point of the function cosi. According to the definition of it, t0 = ^ir.
Squaring yields eJ
(e"
So 7T is a
zero point of the function sin t. Of course, for any t,
i(4kt0+t)
Therefore, both trigonometric functions sin and cos are periodic, with period 2ir. This is their prime period.
Now the usual formulae connecting the trigonometric functions are easily derived. For illustration, we introduce some of them. First, the definition says that
(1)
(2)
siní :
-(eu-2i
Thus the product of these functions can be expressed as
sin t cos t = — (eu - e~u) (eu + e~u)
= l(e^-e-^) = isin2i.
Further, by utilizing our knowledge of derivatives:
,1
cos 2t
(-SÚ12Í)':
(siní cos i)'
The properties of other trigonometric functions
siní
taní =
cos í
cot í = (taní)
can easily be derived from their definitions and the formulae for derivatives. The graphs of the functions sine, cosine, tangent, and cotangent are displayed on the diagrams (they are the red one and the green one on the left, and the red one and the green one on the right, respectively):
A A /V.....
Cyclometric functions are the functions inverse to trigonometric functions. Since the trigonometric functions all have period 2ir, their inverses can be defined only inside one period, and further, only on the part where the given function is either increasing or decreasing. Two inverse trigonometric functions are
arcsin = sin-1
with domain [—1,1] and range [—tt/2, it/2] and
arccos = cos-1
with domain [—1,1] and range [0, it]. See the left-hand illustration..
341
CHAPTER 5. ESTABLISHING THE ZOO
(b) lim v/1 an  = lim An = +00;
n—too n—too
(c) lim \/| an = lim (l + 1)™ = e;
n—yoo n—yoo x ,l/
(d) lim sup \f\a~n~\ = lim sup 2+T^L
lim SUp
7h-00
1.
Therefore, the radius of convergence is (a) r = 8, (b) r = 0, (c)r = 1/e, (d)r = 1. □
5.1.3. Calculate the radius r of convergence of the power series
00 3/—3-
E„277 Vre3+re-3" / _ n\n e     ^n4+2n3+1.in ^      z) ■
77=1
Solution. The radius of convergence of any power series does not change if we move its center or alter its coefficients while keeping their absolute values. Therefore, let us determine the radius of convergence of the series
00 3/_
EV773+77-371 \?774+2773+l-7r™
Since
lim \/n" = ( lim tfn)   = 1   for a > 0,
77 —>-00 \77—>-00 /
we can move to the series
00
l—i
77=1
with the same radius of convergence r = tt/3.
□
5.1.4. Give an example of a power series centered at the origin which, on the interval (—3,3), determines the function
1
x2-x-12'
Solution. As
X2 — x — 12       0-4)0+3)       7 i1-4 x+3
and
x-A —      1-f ~~     4 I 1 + 4 + 42 +       +4" +
1___i_ _ _ 1
:-4 1-f
-i- =-        *        - i
we get
(-*)
x+3 - l-(-f) "3 V1      3 + 32 H h 3
+ •
(-*)
1___i_ _ _i_
x2-x-12 —     28   2-i   4" 21   A 3
77 = 0 77 = 0
'(-D"+1       : "
= E
77=0
21-3" 28-4"
5.1.5. Approximate the number sinl° with error thanlO"10.
□ less
Solution. We know that
1 „.3   ,    1 „.5       1 J7
sin a = a —    a + ■=■ a
— (-1) -277+1
- 2_  (2,7+1)! X n=0
x' +
x e.
---— 3	
\	
\	
\	
j	v s*---
	
	'          — ■--. _
	
The remaining functions are (displayed in the diagram on the right)
arctan = tan-1
with domain R and range (—tt/2, tt/2), and finally
arccot = cot-1
with domain R and range (0, it).
The hyperbolic functions are also of some importance. Two basic ones are
sinha = ^(ex — e x),
cosha = ^(ex +e x).
The name indicates that they should have something in common with a hyperbola. From the definition,
(cosha)2 - (sinha)2 = 2^(ex e~x) = 1.
The points [cosh t, sinhf] e R2 parametrically describe a hyperbola in the plane. For hyperbolic functions, one can easily derive identities similar to the ones for trigonometric functions. By substituting into (1) and (2), one can obtain for example
cosha = cos(ia),       i sinha = sin(ia).
5.4.12. Notes. (1) If a power series S(x) is expressed with the variable a moved by a constant offset a0, we arrive at the function T(a) = S(x — xq). If p fjji is the radius of convergence of S, then T will be "fr-"^ well-defined on the interval (xq — p, xq +p). We say that T is a power series centered at x0.
The power series can be defined in the following way:
S(x) =      an(x -
where a0 is an arbitrary fixed real number. All of the previous reasonings are still valid. It is only necessary to be aware of the fact that they relate to the point a0. Especially, such a power series converges on the interval (xq —p, xq +p), where p is its radius of convergence.
Further, if a power series y = T(a) has its values in an interval where a power series S(y) is well-defined, then the values of the function S o T are also described by a power series which can be obtained by formal substitution of y = T(a) for y into S(y).
(2) As soon as a power series with a suitable center is available, the coefficients of the power series for inverse functions can be calculated. We do not introduce a list of formulae here. It is easily obtained in Maple, for instance, by the procedure "series". For illustration, here are two examples:
342
CHAPTER 5. ESTABLISHING THE ZOO
Substituting x = 7r/180 gives us that the partial sums on the        Begin with
right side will approximate sin 1°. It remains to determine 1 „    1 „     1 A
ex = 1 + x + -xz + -x3 H--x4 + ....
the sufficient number of terms to add up in order to provably 2       6 24
get the error below 10~10. The series Since e° = 1, we search for a power series centered at x = 1
k     i_ (_w_\3    j_ (_w_\5    JL(_z_\7, for the inverse function Inx. So assume
180 — 3! 1180/   + 5! 1180/   ~~ 7! 1180/   + ' ' ' 2 3 4
(_!)"   / w ,2n+i \nx = a0+ai(x—l)+a2(x—1) +a3(x—1) +a4(x—1) +...
_   (2n+l)! 1180/
n~° Apply the equality x = einx, regroup the coefficients by the
= E
is alternating with the property that the sequence of the ab-   p^of x 'md s'ubstitute. The result is: solute values of its terms is decreasing. If we replace any /
/ 2 3 4
such convergent series with its partial sum, the error we thus   a; = a0 + ai I a; + -a; + —x + —x + make will be less than the absolute value of the first term not , .2 ,
included in the partial sum. (We do not give a proof of this + a2 ( a; + -a;2 + ... j  + a3lx + -x" + ... I +
theorem.) The error of the approximation
smi   ~ 180     1803.3! \Z J \0
is thus less than , I ^ „   1 / ^ 1 2 ^ „   , 3
< in-10
1805-5! ^ iU
3
a0 + aix + I \ai + a2)x2 + \ \ai + a2 + a3 ]x°
+ I — a-i + I —I— I a2 H— a3 + cia I xA +
Comparing the coefficients of the corresponding powers on both sides, gives
D 111 5.7.6. Determme the radius r of convergence of the power 2 3 4
series      22n-n\ O '        corresPonds to the valid expression (to be verified
n=o (2n-|! later):
(x-ir.
n
71=1
5.1.7. Calculate the radius of convergence for Y^=i      xn- \nx _      (— 1)™ 1
5.1.8. Without calculation determine the radius of conver-        Similarly, we can begin with the series
gence of the power series ]T a;™"1. O sin t = t - -t3 + -t5 - -t7 + ...
n=i 3!       5! 7!
5.1.9. Find the domain of convergence of the power series and the (unknown so far) power series for its inverse centered
00 _
£ Mlj", q at zero (since sin 0 = 0)
77=1
5.1.10. Determine for which x   e   K the power series
arcsini = ag + ait + a2t2 + a3t3 + a^ř +
(-3)"      /     „,„ r-.    Substitution gives
E ^+2^+111    - 2)  converges. O
t = a0 + ai ( t - Irt3 + -jU5 + ... I +
5.1.11. Is the radius of convergence of the power series 0     1 y     31 5]
00 00
Eon*", ,J,_1,^1
77 = 0 77=1
common to all sequences {an}%L0 of real numbers? O
5.1.12. Decide whether the following implications hold: = a° + ait + fl2f2 + ^~ Qai + a'3)t3+ (a) If the limit lim 3\/öJ exists and is finite, then the power (2 \ 4    / 1 3
77^00    v r \--a2 + a4 )t   + ( 77-^ai - -03 + a5 ]t   + ■
series ^
00
J2 an(x - x0)n 77=1
6 J       \120 6
hence
1 3     3 ,
arcsini = t -\—t H--tb + .
converges absolutely at at least two distinct points a;. 6 40
(b) Conditional convergence of series £~=i a„, E^=i &n (3) Notice that if lt is assumed mat me function ex can be
....      .           ,r.       n \ expressed as a power series centered at zero, and that power
implies that the series >      (oan — 5bn converges as .        .                  ,       .           . .
a^77_i\             / series can be differentiated term by term, then the differen-
we^- tial equation for the coefficients an is easily obtained since
(c) If a series J2^=o a™ satisfies (xn+1)' = (n + l)xn. Therefore, from the condition that
343
CHAPTER 5. ESTABLISHING THE ZOO
then it is convergent.
lim a2 = 0, the exponential function has its derivative equal to its value
at every point,
77—soo
(d) If a series J2n°=i al converges, then the series an+1 ~ n+ia™'      a0 - 1
y <_*
^ n n=l
converges absolutely.
o
5.1.13. Approximate cos    with error less than 10~5. O
5.1.14. For the convergent series
E Jn-
n \AT+100'
77 = 0
bound the error of its approximation by the partial sum sg ggg.
o
5.1.15. Express the function y = ex, defined on the whole real line, as an infinite polynomial whose terms are of the form an(x — \)n. Then express the function y = 2X defined on R as an infinite polynomial with terms anxn. O
5.1.16. Find the function / to which, for 168, the sequence of functions
converges. Is this convergence uniform on R? O
5.1.17. Does the series
oo
E kde x e K,
77=1
converge uniformly on the real line? O
5.1.18. Approximate
(a) the cosine of ten degrees with accuracy of at least 10~5;
(b) the definite integral J^2 pq_y with accuracy of at least 10"3.
o
5.1.19. Determine the power series centered at x0 = 0 of the function
X
/(_) = / e*2 dt, i6R.
o O
5.1.20. Using the integral test, find the values a > 0 for which the series
oo
E 4r
77=1
converges. O
5.1.21. Determine for which i£l the series
oo 1
V--—
^ 2n ■ n ■ ln(n)
x3n
and hence it is clear that an = .
converges. O
344
CHAPTER 5. ESTABLISHING THE ZOO
5.1.22. Determine all x G R for which the power series
oo 2n
£ ^2- is convergent. O
i=l
5.1.23. For which i£l does the series
y~ ln(n!) n=l
converge? O
5.1.24. Determine whether the series
oo
E (-l)n_1tan -±=
n=l
converges absolutely, converges conditionally, diverges to +oo, diverges to —oo, or none of the above, (such a series is sometimes said to be oscillating). O
5.1.25. Calculate the series
oo
71=1
with the help of an appropriate power series. O
5.1.26. For a; e (-1,1), add
x - Ax2 + 9x3 - 16x4 H----
1 ™3n+l 2^  2"-n! X
o
o
51.27. Supposing \ x\ < 1, determine the series
oo
(a) E 27^t s2-1;
n=l
oo
(b) E n2xn~x.
n=l
5.1.28. Calculate
oo
E2ra-l (-2)—1
n=l
using the power series
oo
E (-1)™ (2n + l)x2n
71=0
for some x e (—1,1). O
5.1.29. For 168, calculate the series
2"-7l!
71=0
o
345
CHAPTER 5. ESTABLISHING THE ZOO
J. Additional exercises for the whole chapter
5.J.I. Determine a polynomial P(x) of the least degree possible satisfying the conditions P(l) = 1, P(2) = 28, P(0) = 2, P'(0) = 1,P'(1) = 9. O
5.J.2. Determine a polynomial P(x) of the least degree possible satisfying the conditions P(0) = 0, P(l) = 4, P(—1) = —2,
P'(0) = 1,P'(1) = 7. O
5.J.3. Determine apolynomialP(a;) of the least degree possible satisfying the conditions P(0) = —1, P(l) = —1, P'(—1) = 10,P'(0) = -1, P'(l) = 6. O
5.J.4. From the definition of a limit, prove that
lim (a;3 - 2) = -2.
x—y — oo
5.J.7. Determine both one-sided limits
lim arctan(—),       lim arctan(i).
x^0+ X x^O- X
Knowing the result, decide the existence of the limit lim arctan (—).
x^O X
5.J.8. Do the following limits exist?
sin a;               5a;4 + 1 lim —lim-
i-s-0   X6 x-*0 X
5.J.9. Calculate the limit
,. tan x — sin x lim —
5.J.10. Determine
2 sin3 x + 7 sin2 x + 2 sin x — 3 lim -5-^-•
x-s-ir/6 2 sin x + 3 sin x — 8 sin x + 3
5.J.11. For any m, n G N, determine
a;m - 1 lim -.
x-s-l  Xn — 1
o
5.J.5. From the definition of a limit, determine
lim (1±D^1, x^-i 2
i. e. write the S(e)-formula as in the previous exercise. O 5.J. 6. From the definition of a limit, show that
,. 3(a;-2)4
lim--- = +00.
o
o
o
o
o
o
346
CHAPTER 5. ESTABLISHING THE ZOO
5.J.12. Calculate
lim   ( v x2 + x — :
x—>-+oo V
5.J.13. Determine
lim   ( x \J\ + x2 — x2 )
x—>+oo V /
5.J.14. Calculate
\/2 — \/l + cos x
lim
5.J.15. Determine
sin (Ax)
lim _
y/x + 1 - 1
5.J.16. Calculate
.. VI + tana; — y/1 — tana; lim -.
x-s-o- sin a;
5.J.17. Calculate
2X + Vl + x2 - a;9 - 7a;5 + 44a;2 lim - =-.
x^_oo 3x + V6a;6 + a;2 - 18a;5 - 592a;4
5.J.18. Let lima;_t._0o /(a;) = 0. Is it true that   limx^_00(/(a;) ■ g(x)) = 0    for every increasing function g
o
5.J.19. Determine the limit
/ x 2n-l
,.     /    n > lim
n + 5
5J.20. Calculate
sin a; — a; lim -5-.
5 J.21. For a; > e, determine the sign of the derivative of the function
In 2 1+1
f(x) = arctan
5.J.22. Determine all local externa of the function
y = x In2 a;
347
o
o
o
o
o
o
re?
o
o
o
CHAPTER 5. ESTABLISHING THE ZOO
denned on the interval (0, +00). O
5 J.23. Is there a real number a such that the function j(x) = ax + sin x has a global minimum on the interval [0,2ir] ata;0 = 57r/4? O
5 J.24. Find the absolute minimum of the function
y = & x — In 2,    x > 0
on its domain. O 5.J.25. Determine the maximum value of the function
y=sY3xe~x, xeR.
o
5 J.26. Find the absolute externa of the polynomial p(x) = x3 — 3a; + 2 on the interval [—3,2]. O 5 J.27. Let a moving object's position in time be given as follows:
s(t) = -(i-3)2 + 16, ££[0,7], where t is the time in seconds, and the position s is measured in meters. Determine
(a) the initial (i. e. at the time t = 0 s) velocity of the object;
(b) the time and position at which its velocity is zero;
(c) its velocity and acceleration at time t = 4 s.
Note. The object's velocity is the derivative of its position and acceleration is the derivative of its velocity. O 5.J.28. From the definition of a derivative /' of a function / at the point x0, calculate /' for j(x) = ^fx at any point x0 > 0.
o
5 J.29. Determine whether the derivative of the function
f(x) = a;arctan(±),   i£l\{0},       /(0) = 0 exists at 0. O 5.J.30. Does the derivative of the function
y = sin ^arctan ^| 12a;21 + 11 | • e _^1"^^.i2      + sin(sin(sin(sina;))),   x £ R at the point x0 = it3 + 3*" exist? O 5.J.31. Determine whether the derivative of the function
f(x) = (x2 -ljsin^),   ^-l(ieR),       /(-1) = 0 at the point x0 = — 1 exists. O
5 J.32. Give an example of a function / : R —> R which is continuous on the whole real axis, but does not have derivatives at the points x\ = 5, x2 = 9. O
5.J.33. Find functions / and g which are not differentiable anywhere, yet their composition / o g is differentiable everywhere on the real line. O
5 J.34. Using basic formulae, calculate the derivative of the function
(a) y = (2 — a;2) cos x + 2x sin x,   x £ R;
(b) y = sin (sin x),   x £ R;
(c) y = sin (in (a;3 + 2x)) ,    x £ (0, +00);
(d) y = i±f^,   x £ R.
348
CHAPTER 5. ESTABLISHING THE ZOO
5.J.35. Determine the derivative of the function
(a) y = \JxsJx ^/x,   x £ (0, +00);
(b) y = In I tan § I , \ {nir; n £ Z}.
5.J.36. Write down the derivative of the function
y = sin (sin (shirr)) ,    x £
5.J.37. For the function
j(x) = arccos (^-#) + va?
cvTTž2" + ex (z2 - 2x + 2)
5J.40. Calculate/'(l) if
f(x) = (x-l)(x-2)2(x-3)3, xe
5.].41. Determine the derivative of the function
I a; I ^ 1 (a; e :
5 J.42. Differentiate (with respect to the real variable x)
O
o
o
V2
with maximum possible domain, calculate /' on the largest subset of R where this derivative exists. Q
5.J.38. At any point x £ {nir; n £ Z}, determine the first derivative of the function y = \/sinx. O 5.J.39. For x £ R, differentiate
O
o
o
x In2 (x + VTTx2) - 2Vl + z2 In (a; + Vl + z2) + 2a; at all points where the derivative exists. Simplify the obtained expression. O 5 J.43. Determine /' on a maximal set if j(x) = \ogx e. O
5 J.44. Express the derivative of the product of four functions
if(x)g(x)h(x)k(x)]'
as a sum of products of their derivatives and themselves, supposing all of these functions are differentiable. O 5.J.45. Determine the derivative of the function
y — (x+3)2
for x > 0. O
349
CHAPTER 5. ESTABLISHING THE ZOO
5.J.46. The rainbow. Why is the rainbow circular?
Solution. In the exercise called Snell's law we explained what causes a rainbow. It is created by sunlight being refracted while entering a droplet of water. We continue with this problem, examining how the rays behave when going through the droplets. (See the illustration.) The ray dropping onto a droplet's surface at the point A "splits". Some part of the light reflects (at the angle ipi from the normal line) and the other part refracts inside the droplet at the marked angle ipr. The ray, inside the droplet, reflects off the droplet's surface at the point B. Since | OA | = | OB , the angle of reflection equals ipr. Part of the light refracts out of the droplet. The reflected ray then meets the droplet's surface again at the point C and refracts towards the observer at the angle ipi from the normal line. We omit the case of the secondary rainbow arc, which occurs when the ray reflects twice inside the droplet before refracting out of it.
Write a := LAIC. Since IOAI = ipt and IOAB = ipr, it follows that IBAI = ipt - yr. Then ABIA = 7r - (AABI) - (ABAI) = tt - (tt - <pr) - (<^_ - <pr) = 2pr - p{
and moreover,
a = 2 • ABIA = Apr - 2<fi.
By Snell's law,
^- = n,
sin ipr '
where n stands for the refractive index for water (It is assumed that the index of refraction for air is 1). Thus,
V»r = arcsta(m)
whence it follows that
/ sin ifi \
(1) a = 4arcsin - — 2<pi.
V   n )
For the rays leaving the droplet, the value of a is different. The admissible values of a are not distributed uniformly. If R is the droplet's radius and y is the distance of the point A from the horizontal plane going through the centre of the droplet, then
(2) sin<ft = -|   for ye[0,R}.
By the huge distance of the Sun from the Earth, it is assumed that the amount of energy coming from the Sun for y e [a — S, a + S] is independent of a £ [S,R — S] but depends only on the range of the values of y for sufficiently small S > 0. Thus it makes sense to analyze the function (see (1) and (2))
a(y) =4arcsin(^) -2arcsin(^),      y G [0,R]. Select an appropriate unit of length so that R = l. Consider the function
a(x) = 4arcsin(^) — 2arcsina;,      i£ [0,1].
The derivative is
"'(*) = ~1=T - 7T=f. 2G(0,1) The equation a'(x) =0 has a unique solution
a;o = y^e(0,l),   if n2G(l,4). Set n = 4/3 (which is approximately the refractive index of water). Further,
a'(x)>0,    x G (0,xq),       a'(x)<0,    x e (x0,l).
At the point
-0 = 7-^=1^/1 = 0.86,
350
CHAPTER 5. ESTABLISHING THE ZOO
the function a has a global maximum
a(x0) = 4arcsin(^|) - 2arcsin(§^§) = 0.734 rad - 42 °.
It follows that the peak of the rainbow cannot be above the level of approximately 42 ° with regard to the observer, the values
Q(0.74) = 39.4°,   q(0.94) = 39.2°,   a(0.8) = 41.2 °,   a(0.9) = 41.5 °.
suggest (a is increasing on the interval [0, xq] and decreasing on the interval [xq, 1]) that more than 20 % of the values a lie in the band from around 39 ° to around 42 °, and that 10 % lie in a band thinner than 1 °. Furthermore, for
Q(0.84) = 41.9 °,      q(0.88) = 41.9 °,
the rays for which a is close to 42 ° have the greatest intensity. This is an instance of the principle of minimum deviation: the highest concentration of diffused light happens at the rays with minimum deviation since the total angle deviation of the ray equals the angle S = it — a.
The droplets from which the rays creating the rainbow for the observer come, lie on the surface of a cone having central angle equal to 2a(x0). The part of this cone which is above ground then appears as the rainbow arc to the observer (see the illustration). Thus when the sun is setting, the rainbow has the shape of a semicircle. The rainbow exists only with regard to its observer - it is not anchored in space. The circular shape of the rainbow was examined as early as 1635-1637 by René Descartes. □
5.J.47.   L'Hospital's pulley.
A rope of length r is tied at one of its ends to the ceiling at a point A. A small pulley is attached to its other end.
w-ct y A point B is also on the ceiling at distance d, d > r, from A. Another rope of length / > Vd2 + r2, is tied to B at one end, passes over the pulley, and has a weight is attached to its other end. Omit the mass and the size of the ropes and the pulley. In what position does the weight stabilize so that the system is in a stationary position? See the illustration.
Solution. The system is in stationary position when its potential energy is minimized. This is when the distance between the weight and the ceiling is maximal.
Let x be the distance between A and the point P on the ceiling vertically above the weight and the pulley. Then by the Pythagorean theorem, the distance between the pulley and the ceiling is y'r2 — x2. Similarly, the distance between the weight and the pulley is / — \J(d — x)2 + r2 — x2. Hence if f(x) is the distance between the weight and the ceiling, then
f(x) = \/r2 — x2 + I — \J(d — x)2 + r2 — x2.
The state of the system is fully determined by the value x e [0, r] (see the illustration). So it suffices to find the global maximum of the function / on the interval [0, r]. The derivative is
f(x\ =    -x__-(d-x)-x     =    -x d x G ({) r)
y   >       -Jr2-x2 vV2-x2 ^ ^(d-xy+r2-x2 ' ' >'
Square the equation f(x) = 0 to obtain
2 t2
x _ d
r2—x2 {d—x)2-\-r2—x2'
and hence
2dx3 - (2d2 + r2) x2 + d2r2 = 0, i£(0,r). One solution to this equation is x = d, hence the polynomial on the left side factors into
(x - d) (2dx2 - r2x - dr2) ,
or
2d(x-d) (x - r2+rVrf+W^ (a - ra-rvff+&£) ,
351
CHAPTER 5. ESTABLISHING THE ZOO
Hence the equation f'(x) = 0 has three solutions. The solution x = d is outside the interval [0, r] since d > r. The solution x = x1 = r ~rV^+ML js outside the interval [0, r] since x1 < 0. The solution
r2 + ryV2 + 8d2
X = Xq = —
Ad
, x0 <
only at x0. From the limits
is positive, and furthermore, xq < j + ^ = r, since r < d. Since /' is continuous on the interval (0, r), it can change sign
/'(*) = t7wts,        I™ fix) = -c
it follows that
f(x) > 0,   x £ (0, x0),       /'(x) < 0,   xe (x0, r). Thus the function / has its global maximum on the interval [0, r] at x0. □
5.J.48.   A nameless mail company can only transport parcels whose length does not exceed 108 inches, and for which the N\/iv  sum of the length and the maximal perimeter is at most 165 inches. Find the largest volume a parcel can have to be Y^J/f transported by this company. ^    Solution. Let M denote the length plus perimeter, p the perimeter, and x the parcel's length. Suppose the perimeter
p is constant.
The wanted parcel has a shape such that for any t e (0, x), its cross section has constant perimeter p. (the maximal one).
The parcel is to have the greatest volume so that the cross section of a given perimeter has the greatest area possible. The largest planar figure of a given perimeter is a disc. Thus the desired parcel has the shape of a cylinder with height equal to x and radius r = p/2n.
Its volume is
V = Trr2x = ^.
Consequently p + x < M and x < 108. Thus we consider the parcel for which p + x = M. Its volume is
V[x) = {M~Afx = s'-m^+m3*   where   x e (0,108]. Having calculated the derivative
V'(x) = 3^-wm2 = 3(*-m(*-¥):      XG (0,108),
we find that the function V is increasing on the interval (0,55] = (0, M/3] and decreasing on the interval [55,108] = [M/3,min{108,M}]. The greatest volume is thus obtained for a; = M/3, where
v(¥) = Wt = 0.011 789M3 - 0.8678 m3. If the company also required that the parcel have the shape of a rectangular cuboid (or more generally a right prism of a given number of faces), we can repeat the previous reasoning for a given cross section of area S without specifying what the cross section looks like. Necessarily S = kp2 for some k > 0 which is determined by the shape of the cross section. (If we change only the size of the sides of the polygon which is the cross section, then its perimeter will change by the same ratio. However, its area will change by the square of the ratio.) Thus the parcel's volume is the function
V(x) =Sx = kp2x = k (M — xf x,      xe (0,108].
The constant k does not affect the point of the global maximum of the function V, so the maximum is again at the point x = M/3. For instance, for the largest right prism having a square base, we have p = M — x = 2M/3, i. e. the length of the square's sides is a = Mj6 and the volume is then
V = a2x = ^ = 0.009 259 M3 - 0.681 6 m3. For a parcel in the shape of a ball (when x is the diameter), the condition p + x < M can be expressed as irx + x < M, i. e. x < M/(tt + 1) < 108. Thus for x = M/(tt + 1), the maximal volume is
V =     (f )3 = 6(^+i)a = °-007370M'3 ~ °-5426m3-
352
CHAPTER 5. ESTABLISHING THE ZOO
Similarly, for a parcel in the shape of a cube (when x is the length of the cube's edges), the condition p + x < M means x < M/5 < 108. Thus for x = M/5 the maximal volume
V = x3 = (f)3 = 0.008 M3 « 0.588 9 m3. The length of the edges of the cube which has the same volume as the cylinder is
a = M= = 0.227 595 M « 0.953 849 m. Its length and perimeter sum to 5a = 1.138 M. This is more than the company's limit by around 14 %. □
5.J.49.   A large military area (denoted by MA) has the shape of a square, and has area of 100 km2. It is bounded along its perimeter by a narrow path. From the starting point in one corner of MA, a target point inside MA is situated 5 km along the (boundary) path and then 2 km perpendicularly to it. One can travel along the path at 5 kph, but directly through the MA at 3 kph. At what distance should you travel along the path if you wish to get there as soon as possible? Solution. To travel x km along the path (where x e [0,5]), x/5 hours is needed. One way through MA is then
,/22 + (5 - x)2 = sjx2 - 10x + 29
kilometers long. This takes \Jx2 — \Qx + 29/3 hours. Altogether, the journey takes
f(x) = \x + \s/x2 - 10a;+ 29
hours. The only zero point of the function
J  K   >      5^3 -lOx+29
is x = 7/2. Since the derivative /' exists at every point of the interval [0, 5] and since
/(I) = f|</(5) = f </(0) = ^, the function / has its absolute minimum at the point x = 7/2 Thus you should go 3.5 km along the path. □
5.J.50. You are in a boat on a lake at distance d km from the shore. You want to get to a given place on the shore whose straight-line distance is \Jd2 + I2 from you (see the diagram). What path will you take if you want to be there as soon as possible, supposing you can row at v1 kph and run along the shore at v2 kph? How long will the journey take? Solution. The optimal strategy is given by first rowing in a straight line to the shore at some point [0, x] for x e [0, /], and then running along the shore to the target point [0, /] (see the diagram). So the trajectory consists of two line segments (or only one segment, in the case when x = 1). The voyage to the point [0, x] on the shore takes
hours.
The final run takes
hours.
The total time is given by the function
t(x) = Vd2+x2 + l-=z
on the interval [0, /]. It can be assumed that v1 < v2, for if v1 > v2, the optimal strategy is to row straight to the target point, which corresponds to x = /. The first derivative is
and the second derivative is
Solve the equation Squaring and rearranging gives
t'(x) =       *    . - i€(0,I)
t"{x) = —. d ie(o,i).
v^(d2+x2f
t'(x) = 0, or
x _ v_l
Vd2+x2 ~ v2 '
353
CHAPTER 5. ESTABLISHING THE ZOO
If xq < I, then t(x) has a global minimum at xq on the interval [0, /] since limx^o+ t'(x) < 0, t'(l) > 0, and indeed t"(x) > 0 on the same interval.
If x0 > I, then t'(x) < 0 on all of the interval [0, /] and so the global minimum of t(x) occurs at /.
In the former case, the fastest journey (in hours) takes
t (XQ) = Vd2+X0 + l-X0      = dVV2-Vl + J_ Vl V2 V\V2 V2
In the latter case, the fastest journey takes
t (I) = hours.
Note. An alternative simpler approach for doing the calculations is to use the variable 6 instead of the variable x where x = dtan.0. The fastest journey occurs when sin 6 = v\jv2- This is a limiting case of Snell'slaw.
□
5.J.51.   A company is looking for a rectangular patch of land with sides of lengths 5a and b. The company wants to enclose it with a fence and then split it into 5 equal parts (each being a rectangle with sides a, b) by further fences. For which values of a, b will the area S = 5ab of the patch be maximal if the total length of the used fences is 2 400 m? Solution. Reformulate the statement of the problem: Maximize 5ab while satisfying the condition
(1) 6b + 10a = 2,400, a,b>0.
The function
, r 2,400-10a
a h> 5a ——g-
defined for a e [0,240] takes its maximal value at the point a = 120. Hence the result is
a = 120 m,   b = 200m.
The value of b follows immediately from (1). □
5.J.52. A rectangle is inscribed into an equilateral triangle with sides of length a so that one of its sides lies on one of the triangle's sides and the other two of the rectangle's vertices lie on the remaining sides of the triangle. What is the maximum possible area of the rectangle?
5.J.53. Determine the dimensions of an (open) swimming pool whose volume is 32 m3 and whose bottom has the shape of
a square, so that one would use the least amount of paint possible to prime its bottom and its walls. O
5.J.54. Express 28 as a sum of two non-negative numbers such that the sum of the first summand squared and the second summand cubed is as small as possible. O
5.J.55. With the help of the first derivative, find the real number a > 0 for which the sum a + 1/a is minimal. Now solve this problem without using the differential calculus. O
5.J.56. Inscribe a rectangle with the greatest perimeter possible into a semidisc with radius r. Determine the rectangle's perimeter. O
5.J.57. Among the rectangles with perimeter 4c, find the one having the greatest area (if such one exists) and determine the lengths of its sides. O
5.J.58. Find the height h and the radius r of the largest (greatest volume) cone which fits into a ball of radius R. O
5.J.59. From all triangles with given perimeter p, select the one with the greatest area. O
354
CHAPTER 5. ESTABLISHING THE ZOO
5.J.60. A parabola is given by the equation 2a2 — 2y = 9. Find the points of the parabola which are closest to the origin. O
5.J.61. Your task is to create a one-litre tin having the "usual" shape of a cylinder so that the minimal amount of material would be used. Determine the proper ratio between its height h and radius r. O
5.J.62. Determine the distance from the point [3, —1] G R2 to the parabola y = a2 — a + 1. O
5.J.63. Determine the distance of the point [—4, —2] e R2 from the parabola y = x2 + x + 1. O
5.J.64. At time t = 0, a car left the point A = [5, 0] at the speed of 4 units per second in the direction (—1, 0). At the same time, another car left B = [—2, —1] at the speed of 2 units per second in the direction (0,1). When will the cars be closest to each other, and what will the distance between them be at that moment? O
5.J.65. At the time t = 0, a car left the point A = [0,0] at 2 units per second in the direction (1,0). At the same time, another car left the point B = [1, —1] at 3 units per second in the direction (0,1). When will they be closest to each other and what will the distance be? O
5.J.66. If a cone has a base of radius r and height h, then its surface area (not including the base) is irrh and its volume is V = ^7rr2/i. Determine the maximum possible volume of a cone with total surface area (including the base 3ir cm2. O
5.J.67. Suppose you own an excess of funds without the possibility of investing outside your own factory. This acts as a regulated market with a nearly unlimited demand and a limited access to some key raw materials, which allows you to produce at most 10 000 products per day. You know that the raw profit p and the expenses e, as functions of a variable x which determines the average number of products per day, satisfy
v(x) = 9x,   n(x) = x3 — 6a2 + I5x,   x e [0,10].
At what production will you profit the most from your factory? O
5.J.68. Determine
1'
lim   cot a; —
x-s-0 V X
Solution. Notice that
lim cot a = +oo,        lim — = +oo, x-s-o+ x-s-o+ a
lim cot a = —oo,        lim — = —oo,
x—>0— x—>0— X
we can see that both one-sided limits are of the type oo — oo. Thus we consider the (two-sided) limit.
We write the cotangent function as the ratio of the cosine and the sine and convert the fractions to a common denominator.
1 \ a cos a — sin a
lim   cot a--= lim
x-t-o \ ay     x-t-o      asm a
Thus we obtain an expression of the type 0/0 for which we get (by l'Hospital's rule)
a cos a — sin a           cos a — a sin a — cos a —a sin a
lim-:- = lim-:- = lim
x-s-o      a sin a x-s-o      sina + acosa x-s-o sin a + a cos a
By one more use of l'Hospital's rule for the type 0/0, we get
—a sin a — sin a — a cos a 0 — 0
lim- = lim- = - = 0.
x-s-o sin a + a cos a    x-s-o cos a + cos a — a sin a     1 + 1 — 0
□
5 J.69. Determine the limit
7tx
lim (1 — x) tan —.
o
355
CHAPTER 5. ESTABLISHING THE ZOO
5.J.70. Calculate
lim  ( — — atari a;
o
5.J.71. Using l'Hospital's rule, determine
lim ((z*-2i\x\.
I-S- + 00 V V / /
O
5 J.72. Calculate
lim [---7,-        I .
x-n \21na     x2 - 1 /
O
5.J.73. By l'Hospital's rule, calculate the limit
r  ( 2V2
lim     cos —
i-s-+oo y    x J
o
5.J.74. Determine
lim (1 — cos
i-s-0
l)smI
o
5.J.75. Determine the following limits
lim x1^, lim  x1^,
x—>0+ x—>-+oo
where aeRis arbitrary. O 5.J.76. By any means, verify that
lim- = 1.
i-s-o a
o
5.J.77.   By applying the ratio test (also called D'Alembert's criterion; see 5.4.5), determine whether the infinite series
(a) E
2"-(n+l)
3" '
n=l
oo
(b) E S;
77=1
oo n (C)   E n^~nl
77=1
converge.
Solution. Since (an > 0 for all n)
(a) lim   £_±i _   Km   2"+1-("+2)3-3" _   l-       2(t7+2)3 _ 2tt_ _ 2 < 1.
(aj  urn —    — urn „„+1 „  ,    ,s3 — urn „,  , ,s3 — lim =-3- — ^ < 1,
77—>-00 77—>-00 ° ^     v'^-lj 77—>-00 °\,lTi-) jj—^qq Oil O
(b) lim = lim f t^1- •       = lim        = 0 < 1; kj 77^00 a"      77^00 V(n+1)!   6/    77^00 "+1
(C)  lim £_±i = lim C   <"+1)"t_11,, ■ a^l) = lim ■ lim = lim 4 ■ lim (l + ±)" = 1 • e > 1,
77^00   a" 77^00 V(n+1)2'(™+1)!      «"   /       „^oo 77^00     n 77^00 n     77^00 v n/
the series (a) converges; (b) converges; (c) does not converge (it diverges to +00). □
356
CHAPTER 5. ESTABLISHING THE ZOO
5.J.78.   By applying the root test (Cauchy's criterion), determine whether or not the infinite series
oo
(a) E
ln"(n+l) '
77=1 oo     (üilV
(b) E
(c) E arcsin™ f£
71=1
converge.
Solution. Consider the series with non-negative terms only, where
(a) lim $/ö^ = lim . ,1   - = 0< 1;
77—soo 77—soo mV"Ti;
(mV lim (1+i)"
(b) lim ^/ö~ = lim ^ " 7   = - = f < 1;
(c) lim        = lim aresin     = aresin 0 = 0 < 1.
77—soo 77—soo
So each of the above series converges. □
5.J.79.   Determine whether or not the series
00
(a) yj(-l)"ln(l + i);
77=1
(b) E
77=1
(C)   E (6+(-l)")" 77=1
converge.
Solution. The case (a). By l'Hospital's rule,
m(l + -L) 1+j^ (1+2^) ,
lim  — jj*   =   lim —2X, 1 - =   lim = 1,
x—s+oo       2x x—s+oo ) x—s+oo l-"- 2^
hence
0<ln(l + i)
for all sufficiently large n eN. However, the series E^Li    is convergent. So
E In (1 + ^r) < +00.
77=1
The series converges (absolutely). The case (b). The ratio test gives
lim   £=±i   = lim ,2("+1' ""J = lim ^2±i = lim        = +00.
77^00 I    a"    I      77^00 (77+l)!-2"2       „^oo  n+1        77^00 "+1
Thus the series does not converge.
The case (c). Use the general version of the root test
lim sup \/\ an = lim sup g+Aw = f < 1-
77—soo 77—soo
It follows that the series is (absolutely) convergent. □ 5.J.80.   Determine whether or not the following alternating series converge:
72+377-1 . (3t7-2)2 '
4-3773
(5t73-2)-4™
(a) E +3"~1'
^2 (—i)™"1   3774-3t73+977-1
77=1
Solution. Case (a). Since
lim "2+3_"721 = lim ^4 = k ^ 0,
77—s-oo    w™    Z> 77—s-oo an a
it follows that the limit
357
CHAPTER 5. ESTABLISHING THE ZOO
does not exist. Therefore, the series does not converge since a necessary condition for the convergence is not satisfied.
The case (b). When applying the ratio (or root) test, the polynomials in the numerator or in the denominator do not affect the value of the considered limit. Consider the series
oo
71=1
for which
lim p^.   = ± < 1.
This means that the original series is also (absolutely) convergent. □ 5.J.81.   Does the following series converge?
oo
£(-ir+1arctan-|=
77=1
Solution. The sequence {2/V3~n}neN is decreasing and the function y = arctan x increasing (on the whole real axis). So the sequence {arctan (2/v/3n) }nGN is decreasing. Thus it is an alternating series such that the sequence of the absolute values of its terms is decreasing. Such an alternating series converges if and only if the sequence of its terms converges to zero (the Leibniz criterion), and this is satisfied:
lim arctan —2= = arctan 0 = 0, i. e. lim ((—1)n+1 arctan —1= ] = 0.
77-S-OO VJ77 77-S-OO  V V377 J
□
5. J.82.   Determine whether the series
oo
(a) e
77=1
OO
ru\ cosO")
71=1
converges absolutely, converges conditionally, or does not converge at all. Solution. The case (a). This series converges absolutely. For instance,
oo oo oo
e I W I < e £ < e £ = 2,
77=1 77 = 1 77 = 0
and the second inequality is already proven.
The case (b). cos (im) = (—1)™, n G N. So it is an alternating series such that the sequence of the absolute values of its terms is decreasing. Therefore, from the limit
lim -4=^ = 0
it follows that the series is convergent. On the other hand,
OO    I ,   I oo oo
Ecos im v—v l       ^ v—v   l .
77=1 1 1 77 = 1 77=1
The series converges conditionally. □ 5.J.83.   Calculate the series
(a) e
(A___i—
(b) e £;
71 = 0
OO
(c) e       + 4^
77=1
(d) e
77=1
358
CHAPTER 5. ESTABLISHING THE ZOO
(e) E -1
(3n+l)(3n+4) ' n=0
Solution. The case (a). By the definition,
OO /
(_i_____
n=l v
lim (f i       i Uf4.-4.H-"+^
n—¥oo
^o(1 + (-^ + ^) + -" + (-^ + ^)-7sit) = 1-
The case (b). This is a convergent geometric series with the common ratio q= 1/3, hence
E^ = 5E(|)B = 5-I^ = ^.
n=0 n=0 3
The case (c). By substituting m = n—l,
OO OO OO
E  (42"-1 + I2™") = 4  £  (42^2") + ig  E  (42^2) = n=l n=l n=l
00 00
U "T" 16/  2_<  42m       16  2_<   \16/ 16 ' 1--L 15'
m=0 m=0 16
The series of linear combinations was expressed as a linear combination of a series (to be more precise, as a sum of series factoring out the constants). This is a valid modification supposing the obtained series are absolutely convergent. The case (d). From the partial sum
sn = k + w + w + ••• + £-, neN,
we obtain
Thus
Since lim ö_ft = 0,
n—too
^ + ^ + ■■■ + ^+3^, neK
§ + ^ + ^ + ■■■+3^-3^,
12(^ = 1(^-1) = !-
The case (e). It suffices to use partial fraction decomposition
(3n+l)(3n+4) = 5 ' 3tT+1 ~~ I ' 37T+4' ™ G N U {0},
which gives
OO /
V i _ l™ I1_I4.I_I4.I_I i.      i _i_____
4-/Q (3n+l)(3n+4)       „" __                4^4       7^7       10 ^ 3n+l 3n+4
= lim i   1 -
_ 1
n_+oo 3 V -1-      3n+4 y       3 '
□
5.J.84.   Verify that
Solution.
or the general bound
E n2  <  E 2"
71=1 71 = 0
—    '      22 + 32  ^        22 — 2'      42 + 52 + 62 + 72  ^ ^ 42—4,
(2™)2   1 1   (2"+!-l)2  ^ "       (2™)2 — 2'
< 2™ ■ Triw = 4-,    n e N.
359
CHAPTER 5. ESTABLISHING THE ZOO
By comparing the terms of both series, we get the desired inequality. It follows that the series J2n°=i *s absolutely convergent.
Note that
00   i       2 00 1
77 = 0
5.J.85.   Examine the convergence of the series
Solution. Add the terms of this series.
Y ln-±I.
77
77=1
Vln2±l= lim (In I + In § + In \ + ■ ■ ■ + In 2±I)
n—yoo
71=1
lim In 2-3-4---(n+i)      lim ln (n + 1) __ +00.
„    x 1-2-3---77 „    , \        1 /
77—SOO 77—SOO
do not converge. Solution. Since
and
__, arctan       n+1      ,    __ n3+n2_n
71=0 71=1
lim arctan "2+2"+3v^+4 __ iim ajctan ^ = f
77—S-OO 77—SOO 71 Z
lim = lim |^ = +00,
5.J.87.   Determine whether or not the series
00
y  1 °
Win n. '
77^+1 . 3 7
77=1
□
Thus the series diverges to +00. □ 5.J.86.   Prove that the series
77^+77^—77
the necessary condition lim an = 0 for the series yT=„ an to converge does not hold. □
converges.
Solution. From the inequalities (consider the graph of the natural logarithm)
1 < Inn < n,   71 > 3, 71 G N
it follows that
\/T < \/ln7i < ^/n,   71 > 3, 71 G N.
By the squeeze theorem (5.2.12),
lim A/Inn = 1,   i. e.     lim   -,1    = 1.
77—SOO 77—SOO Vln77
Thus the series is not convergent. Since all its terms are non-negative, it diverges to +00. □
5.J.88.   Determine whether or not the series
00
(a) y (n+i).3™; 77=0
00 2
0» E ^
(c) _1_
v ' 77 —In 77
77=1
converges.
Solution. All of the three enlisted series consist of non-negative terms only. So the series either converge, or diverge to +00.
360
CHAPTER 5. ESTABLISHING THE ZOO
(a) E Tn+k^ < E (i-r = T_i < +00;
77=0
77=0
(b) E     > E £ = E i = +00;
n=l n=l n=l
00 00
(c) E       > E 1 = +oo-
v 7 77 — Inn —  t—t n
n=l n=l
It follows that (a) converges; (b) diverges to +00; (c) diverges to +00. □
5.J.89. Begin with a square with sides of length a > 0. Construct a sequence of squares, each of which has as vertices the midpoints of the preceding square. Determine the sum of the areas and the sum of the perimeters of all these (infinitely many) squares. O
5.J.90. Let a sequence of rows of semidiscs be given, such that for each n e N, the n-th row contains 2™ semidiscs, each with radius of 2~n. What is the area of an arbitrary figure consisting of all these semidiscs, supposing the semicircles do not overlap? O
5.J.91. Solve the equation
1 - tana; + tan2 x - tan3 x + tan4 x - tan5 x H----= ta^2+1 ■
5.J.92. Determine
EG^t + ä
o
o
5.J.93. Calculate
J2 Vn2 + 2n + l.
71=1
5.J.95. Calculate the series
(a) E ^
71=1
OO
(b) E
5.J.96. Sum the series
1-3 ^ 3-5 ^ 5-7 ^
V _1_
4^ (2t7-1)(2t7+1
5.J.97. Using the partial fraction decomposition, calculate
(a) E 7^1'
71 = 2
o
5.J.94. Prove the convergence of the series
and find its value.
E
3"+2"
o
o
o
361
CHAPTER 5. ESTABLISHING THE ZOO
(b) E
n=l
O
5.J.98. Determine the value of the convergent series
E 4n2-l' n=0
o
5.J.99. Calculate the series
V 1
n2+3n'
n=l
O
5J./00. In terms of
* ■ 77 1      2^3      4^5      6^7      8 ^ '
77=1
express the following two series
(!-§-*) + (§-*-§) + ■■■; (! + *-*) + (* + *-*) + ■■■
(both the series contain the same elements as the first one, only in a different order). O 5 J.101. Determine whether the series
~   2" + (-2)"
77 = 0
converges. O 5.J.102. Prove the following statement: If a series E^Lo a77 converges, then lim sin (3a„ + it) = 0. O
77—soo
oo      _ oo oo
5 J.103. For which a G R;   /3 G Z;   7 G R \ {0} do the series  E  ^ir1;        E  ^r1;        E -^converge?
77=120    n 77=240   n 77=360 7
o
5 J.104. Determine whether the series
E ( x^""g~5"°+2"
2
77=21
converges absolutely, converges conditionally, or does not converge at all. O 5 J.105. Determine whether or not the limit
lim (A + A + '-' + ^Sr)
n—too
is finite. Notice that one cannot use the sums
n2 —   6 '       2^    n2    — "h°°-n=l n=2
O
5.J.106. Find all real numbers A > 0 for which the series
oo
E(-l)nln(l+A2n)
77 = 1
is convergent. O 5.J.107. Recall that the harmonic series diverges.
362
CHAPTER 5. ESTABLISHING THE ZOO
Determine whether or not the series
1+       +9 + 11+       +19+21+       +29 +
..._|_J__|_..._|_J_J--1--1- ... J--1--1--1--u . . .
+ 91 +       + 99 + 111 +       + 119 + 121 +
is also divergent. O
5.J.108. Give an example of two divergent series E^ii a«> E^ii with positive numbers for which the series E^ii (3fln — 2b„) converges absolutely. O
5.J.109. Determine whether the two series
00 I    P\2 00 7 4
El    -\\n .       r»f    -i \n n —n+n
V    L)    (2n)!' L> n8+2ne+n
71=1 71=1
converge absolutely, converge conditionally, or do not converge at all. O 5 J.110. Does the series
OO ,—      c —
El _ 1 \n+l xfn+^n+1
n=l
converge? O 5 Jill. Find the values of the parameter p G R for which the series
OO
E (-l)"sin"£
71=1
converges. O 5 J.112. Determine the maximal subset of R where the function
. cos(^-21+cos *)+*-256*3
y = arctg [xZL + sin a;) ■ $-2+x^-
can be defined. O
5.J.113. Write the maximal domain of the function y = aicT|2_1  • O
_ arccos (In x)
5 J.114. Determine the domain and the range of the function
y ~ 2-3x'
Then determine the inverse function. O
5.J.115. Is the function	
(a) y =	cos X . X3 '
(b) y =	cos X    I    -1 . x3    + X'
(c) y =	cos X . X4 '
(d) y =	cos X     1    -| . x4   + 1'
(e) y =	sin a; + tan 1;
(f) y =	In f±2i; 1—X '
(g) y =	sinha; = e ~2e
(h) y =	cosh a; = S—t|-
with the maximal domain odd, even, or neither? O 5.J.116. Is the function
(a) y = sep;
(b) y = ^ + 1;
(c) y = ^;
(d) y =       + 1;
363
CHAPTER 5. ESTABLISHING THE ZOO
(e) y ■
(f) y = in £f;
(g) y = sinha; = eX~e ';
(h) y = coshx = eX+2e
with the maximal domain even, odd or neither? O 5.J.117. Determine whether the function
(a) y = sin x ■ In | x |;
(b) y = arccotga;;
(c) y = xs - {/3x6 + 3a;2 - 6;
(d) y = cos (tt — x);
(e) V = tanx+x ^ '  "      3+7 cos x
with the maximal domain is odd or even. O 5 J.118. Is the function
(a) y = In (cos x);
(b) y = tan (3a;) + 2 sin (6a;)
with maximal domain periodic? O 5.J.119. Draw the graphs of the functions f(x) = e' x I, a; eE, g(a;) = In | x |, a; G R \ {0}. O 5J.120. Draw the graph of the function y = 2~lxl,a;eR. O 5 J.121. The functions
sinh.r = -—j—. x G R;   cosh ./• = eX+2e * , x G R; tanha; = ||f, a; G R;   cotha; = ^f, a; G R x {0} are called hyperbolic functions. Determine the derivatives of these functions on their domains. O
5 J.122. At any point x G R, calculate the derivative of the area hyperbolic sine (denoted arsinh), the function inverse to the hyperbolic sine y = sinh x on R. O Remark. Note, that the inverse functions to the hyperbolic functions y = cosha;, x G [0,+oo), y = tanha;, x G R and y = cotha;, x G (—oo, 0) U (0, +oo) are called area hyperbolic functions (y = arsinh a; belongs to them, too). They are denoted arcosh, artanh, arcoth, respectively and are denned for x G [1, +oo), x G (—1,1), and x G (—oo, —1) U (1, +oo), respectively. Let us add that
(arcosha;)' = ^x\_x,   a; > 1, (artanha;)' = jz^z,    \x \ < 1, (arcoth x) ' = x | > 1.
5.J.123.   Calculate the sum of the series:
2     12 12 2 + 1+2! + 3!+4! + 5! + 6!+'"
364
CHAPTER 5. ESTABLISHING THE ZOO
Solutions to the exercises
SA.7. P(x) = (-§ - f *)x2 + (2 + 3t)x -I-ft.
5A.15. Sought spline differs from the one in 5.A. 14 only in the values of the derivatives at the points —1 and 1. Similarly to the previous task, we get that the parts Si and S2 of our spline have the forms Si (x) = ax3 + bx2 + 1 and S2 (x) = —ax3 + bx2 + 1, respectively, where a, b, c, d are unknown real parameters. Confronting this with the conditions Si(—1) = 0, Si(—1) = 1, &(1) =0, and S2(l) = 1 yields the system
-a + 6+1 = 0, 3a - 2b = -1,
having the solution a = —3, b = —4. Hence, the wanted spline is the function
„,        f-3x3 - 4x2 + 1   pros e [-1,0], [X> ~ { 3x3 - 4x2 + 1     pro x e [0,1].
■ 2x - 4.
5A.17. (2x2 - 5) /3; eg. (fx2 - |)3. 5A.18. a = 1, b = -2, c = 0, d = 1. 5A.19. x3 + x2 - x + 2. 5A.20. Infinitely many.
5A.21. P(x) = x3 - 2x2 + 5x - 3; Q(x) = x3 - 2x2 + 3x - 3. SA.22. x5 - 2x4 -5x + 2. 5A.23. x2.
SA.24. x3 - 2x + 5; x3 - x + 6. 5A.25. Infinitely many. 5A.26. Eg. x2 - 3x + 6.
5A.27. Si(x) = \ (x + 1)3 - f (x + 1) + 1, x e [-l,0];S2(x) = -\x3 + f x2,x e [0,1]. 5^4.2«. Si(x) = i (x + 1)3 - f (x + 1) + 1, x e [-l,0];S2(x) = -ii3+|i2,x£ [0,1]. 5^4.2P. Si (x) = x; S2 (x) = x. 5A.30.Si{x) = l;S2(x) = 1.
5^4.3/. Si(x) = x + 3, x e [-3 + i - 1, -3 + i\\i e {1, 2}.
5.A.32. Si(x) = 1 - § x + j-0 x3; S2(x) = \ - f (x - 1) + £ (x - l)2 - ^ (x - l)3. 5.B.2.
sap A = 6, inf A = —3;
sup B = i, inf £> = — 1;
supC = 9, infC = -9.
5.B.3. It can easily be shown that
3
sup A = -,       inf A = 0.
5.B.4. Clearly
inf'N = l,    supM = Q,    inf .7 = 0,   supj = b. 5.B.5. We can, for instance, set
M := Z \ N;       iV := N. 5.B.6. Consider any singleton (one-element set) Xcl
5.B.7. The set C must be a singleton. Thus, let us choose C = {0}, for example. Now we can take A= (—1,0), B = (0,1). 5.C.5. We have
/ 1       2              n-2     n-l\      ,.     /l + n-1   n-l\ 1 lim    — + — + • • • +-— +-—    = km ' 1
n^oo v n2     n2 n2 n2   j     n^oo v      n2 2/2
365
CHAPTER 5. ESTABLISHING THE ZOO
5.C.6. It can easily be shown that
lim
II >'X:
y/n3 - lln2 + 2 + \/ri
- 2n5
■ n + sin n
f/5n4 + 2n3 + 5
5.C.7. The limit is equal to 1. S.C.8. We can, for instance, set
y-n := —n + 1,
n e N.
5.C.9. The answer is ±1. 5.C./0. The result is
lim sup an =
n—>oa
5.C.11. We have
11m inf ((-!)" (l +
lim inf i
+ sin •
72 2
5D.5. The examined function is continuous on the whole R.
5.D.6. The function is continuous at the points —7r, 0, tt; only right-continuous at the point 2; only left-continuous at the point 3; and
continuous from neither side at 1. 5JJ.7. It is necessary to set /(0) := 0. 5JJ.8. The function is continuous iffp = 2. 5D.9. The correct answer is a = 4. 5JJ./0. It holds that
lim
lim
- 0.
5D.13. The only solution is x = — 1.
5.D.14. It has even two roots, because P(-l) > 0 > P(0) < 0 < P(l) and due to 5.2.19 there is one root in (-1,0) and one in (0,1).
5.E.4.f'{x) = 2xlnx-1 -Inx.
5.E.5. (sin x)1+cos x (cot2 x - In (sin x)).
S£.7. f -^0.003.
5.E.8. a ss f + 0.01; b sa 4.125.
j.c.y. a; 2     360 , d; 2 -i- 360 .
5.E./3. (a) 12 ft/s; (b) -59, 5 ft2/s; (c) -1 rad/s.
5.E.14. The slope of the tangent line of the polynomial P is given by the derivative of the polynomial. Consider P(0) and P(2). Yes, there is.
■ ■ -   , _ rcn x xo ' = 2x.
5.E.15. 5.E.16. 5.E.17.
5.E.18. y-lJf-5.E.19.[^2\\.
5.E.20. t:y=f + |;n:j/ = -6x + 15
yZ(x + l);y = -&(x + l).
¥ = (if-¥)(--D;j/-
In n
2
j(x-l).
3 |+ 11
S.E.21. tt/4.
5.E.22. y = 2 — x;y = x.
5.E.23. The inequalities follow, for instance, from the mean value theorem (attributed to Lagrange) applied to the function y = In (1 + t),
t e [o,x].
5.7.6. r = +oo.
5.7.7. 1.
366
CHAPTER 5. ESTABLISHING THE ZOO
5.1.8. 3.
5.1.9. [-1,1].
5.1.10. x e [2-i,2 + i].
5.1.11. It is.
5.1.12.
(a) True.
(b) False.
(c) False.
(d) True.
5.1.14. The error lies in the interval (0,1/200).
5.U5. E°° n      -!)"; E°° n ^
5.1.16. f{x) = x, x e R;itis. 5.7.77. It does not.
5.1.18. (a) 1 - + j§^/, (b) | - ^r-
5 7/9 r™ _1_x2n+1
o.i.±y. l^n=0 (2n+i) „! x
5.1.20. a > 1.
5.7.2/. [-^2, ^2).
5.7.22. For x e [-1,1].
5.7.23. x > 2.
5.1.24. The series is absolutely convergent.
5.7.25. In (3/2).
5.7.26. ^.
5.7.27. (a)ilni±f;(b)^. 5.7.2S. 2/9.
5.7.29. xe^.
5.J.I. x4 +2x3 - x2 + x - 2. 5J.2. x4 + 2x3 - 2x2 + x + 2. 5J.3. x4 + 3x3 - 3x2 - x - 1.
5.7.4. For every e > 0, it suffices to assign to the e-neighborhood of the point —2 the 5-neighborhood of the point 0 given by
s n- 8,      8 = s,
and without loss of generality, we can assume that e < 1. Since if e > 1, we can set 5 = 1. 5J.5. Existence of the limit and the equality
(l + x)2-3 3 lim 2--- =--
a;->-l 2 2
follows from the choice 8 := e for e e (0,1).
5.J.6. Since - (x - 2)4 < x for x < 0, we get 3 (x - 2)4/2 > -x for x < 0. 5 J. 7. As
,. 1        TT ,. 1 7T
lim arctan — = —,       lim arctan — =--,
a:^0+ X        2 a;^0- X 2
the considered limit does not exist.
5J.8. The former limit equals +00, the latter does not exist.
367
CHAPTER 5. ESTABLISHING THE ZOO
5 J.9. The limit can be determined in many ways. For instance:
,.    tan x — sin x (tan x — sin x   cot x
urn-=-        = urn -=-•-
a=^o     sin i i^o\     sin i cotx
,.      1 — cos x 1 — cos x
nm -~— = lim
cos x ■ sin2 x cos x (1 — cos2 x)
lim
no cos i (1 + cos i) 2
5J./0. We have
2 sin3 x + 7 sin2 x + 2 sin x — 3               sin x + 1 lim -=-=- =   lim - = —3.
_->7r/6 2 Sin   X + 3 Sin   X — 8 Sin X + 3       3j^7r/6 sin x — 1
5J.//. We have
xm - 1 m lim - = —.
_->i x" — 1 n
5 J.12. After multiplying by the fraction
\/x2 + x + : \/x2 + x + :
it follows that
lim   f\/ x2 + x — x] = —.
3:—> + oo V / 2
5J./3. We have
lim   fx \J\ + x2 — x2") = —.
3:—> + oo V / 2
We have
,.    72 - VI + cosx V2 lim -=- = -.
3=^0 sin x 8
5J.15. By extending the given fraction, we obtain
sin (Ax) lim        v   ;    = 8.
nO y/X + l - l
5J./6. We have
..     VI + tan x — VI — tan x lim - = 1.
a;-x)— sinx
5J.17.
2X + y/l + x2 - x9 - 7x5 + AAx2 7
lim - —- = —.
3X + V6x6 + x2 - 18x5 - 592x4 18
5J.18. The statement is false. For example, consider
f(x):=—,    x£ (-oo,0);       g(x) := x, lEl.
5.J.19.
(   n   \2,1-1
lim (-- I        = e~10.
moo \n + o J
5J.20.-|.
5J.2/. / (x) < 0, x > e.
5J.22. The function has a local maximum at the point xi = e~2. It has a local minimum at the point x2 = 1. 5 J.23. No: if a = V2/2, there is only a local extrémům at the point. 5J.2„ 2 = e i - In ±. 5J.25. ^.
5J.26. 4 = p (-1) = p (2), -16 = p (-3).
5.J.27. (a)u(O) = 6m/s; (b) í = 3 s, s(3) = 16 m; (c) u(4) = -2m/s,a(4) = -2m/s2.
368
CHAPTER 5. ESTABLISHING THE ZOO
5.J.28. f (xo) =
5J.29. It does not because the one-sided derivatives differ (specifically: 7r/2 from the right and —7r/2 from the left).
5J.30. It does.
5J.31. It does not.
5J.32. f(x) :=|x-5| + |x-9|.
5J.33. Let / = g = 1 at the rational numbers and / = g = — 1 at the irrational numbers.
5.J.34. (a) x2 sinx; (b) cos (sins) • cosx; (c) ff^f cos (in (x3 + 2s)); (d) ■
5J.35. (a) | x~ s; (b) cosecx = -4—.
5.J.36. cos x • cos (sin x) • cos (sin (sin x)).
5J.37. f (x) = Vl+l_x, H,i6 (1-V2.1 + 4
5J.3S.
3v sinz x
5J-39.^+x2^. 5J.40. -8.
In2 (x + \/l + s2), x e R. 5J.4J. / (x) = -I (log, e)2, x > 0, x + 1.
[/(x)g(x)/i(x)fc(x)]' = f (x)g(x)h(x)k(x) + f(x)g'(x)h(x)k(x) + f(x)g(x)h'(x)k(x) + f(x)g(x)h(x)k'(x).
r J Ac   'j? o+l)2 / 3    ,   _J_   ,   _^___2_\
J-J-*J- ^l-1        3(s+2)        x+3 j-
5 J.52. The inscribed rectangle has sides of lengths x, \/3/2(a — x), thus its area is v^3/2(a — x)x. The maximum occurs for x = a/2,
hence the greatest possible area is (\/3/8)a2.
5J.53. 4m x 4m x 2m.
5J.54. 28 = 24 + 4.
5J.55. a = 1.
5 J.56. 2\/5r.
5 J.57. It is the square with sides of length c). 5J.58.h= ffl, r = 2^1 R
5 J.59. It is the equilateral triangle (with area 5J.60. [2,-1/2], [-2,-1/2]. 5J.6/. /i = 2r.
5J.62. The closest point is [1,1], the distance then 2\/2.
5J.63. The closest point is [—1,1], distance 3\/2.
5J.64. t = 1, 5s, the distance between will be \/5 units.
5J.65. It will happen at the time t = ^ s, the distance being -^p units.
5.J.66. P = irrv + 7rr2 =>■ u = p^.r   =>■ V = |r (P — 7rr2). The extremum is at r = ^J~£;, the substitution gives V = cm3.
5 J.67. At about 3 414 products per day. 5J.68. Triple use of l'Hospital's rule gives
sin x — x 1
urn
5J.69. 2/tt. 5 J.70.
a=^o-     x3 6
lim   I — — x | tan x = 1.
=->?■ — V 2
369
CHAPTER 5. ESTABLISHING THE ZOO
5.J.71.
lim
X—S- + OO
(0
3- - 2*
)
5J.72.1/2. 5 J. 73. We have
lim
X—S- + OO
cos —
X
2
)
= e
-2
5.J.74. By applying l'Hospital's rule twice, one obtains
i ■      /-i \sin x O-i
lim (1 — cosx       =e =1.
x^o
5 J. 75. In both cases, the result is e°.
5J.76. The limit is easily calculated by l'Hospital's rule, for instance. 5J.89. 2a2; la (2 + v/2). 5J.90. tt/2.
5.J.91. x = f + kir, x =     + kir, k e Z.
5 J.92. 5.
5J.P3. +oo.
5.J.94. 3/2.
5J.P5. (a) 3; (b) 9/4.
5J.P6.1/2.
5J.97. (a) 3/4; (b) 1/4. 5J.98. -1/2. 5.J.99.11/18.
5J.100.s/2;3s/2 (s = ln2). 5J./0/. It does.
5 J.102. It suffices to consider the necessary condition for convergence, namely limn-><x> an = 0. 5J.103. a > 0; /3 e {-2, -1, 0,1, 2}; 7 e (-00, -1) U (1, +00). 5 J.104. It is absolutely convergent. 5 J.105. The limit is 1/2. 5J.106.Ae [0,1).
5J.107. The value of the given series is finite - the series converges. 5J.108. For example: an = n/3, fen = n/2, n e N.
5J.109. The former series converges absolutely; the latter converges conditionally. 5J.//0. It does. 5 J.111. p e R. 5.J.112. R. 5J.113. (l,e].
5.J.115. (a) yes; (b) no; (c) no; (d) no; (e) yes; (f) yes; (g) yes; (h) no. 5J.116. (a) no; (b) no; (c) yes; (d) yes; (e) no; (f) no; (g) no; (h) yes. 5J.117. The functions (a), (e) are odd; the functions (c), (d) are even. 5J.118. It is periodic. The prime period is (a) 2-k; (b) 7r/3.
5J.119. The functions / and g are even, so it suffices to consider the graphs of the functions y = ex, x e [0, +00) and y = lnx, x e (0, +00).
5 J.120. The given function is even, so to draw its graph, it suffices to know the graph of the function y = 2X, x e (—00,0].
5J.121. (sinhx)' = coshx; (coshx)' = sinhx; (tanhx)' = C0S^2X(cothx)' = — sil^2x-
5.J.114. (-00, |) U (f ,+00'
i)u(-i,+oo);i,
3x + l'      ' 3'
5.J.122.
Vi+x2 ■
370
CHAPTER 5. ESTABLISHING THE ZOO
5.J.123. Consider the series with the expansions of the functions sinh and cosh into power series. The result is
sinh(l) + 2cosh(l).
371
CHAPTER 6
Differential and integral calculus
we already have the menagerie, but what shall we do with it? - we'll learn to control it...
A. Derivatives of higher orders
First we'll introduce a convention for denoting the derivatives of higher orders: we'll denote the second derivative of function / of one variable by /" or f(2\ derivatives of third or higher order only by f(3\ f(4\... . For remembrance, we'll start with a slightly cunning problem using "only" first derivatives.
6.A.I.   Determine the following derivatives:
i) (x2 ■ sin a;)",
ii) (xx)",
m (^)(3),
iv) (xn)(n\
v) (sinx)^).
Solution, (a) (x2 ■ sinx)" = (2x sinx + x2 cos x)'
= 2 sin a; + 4x cos x — x2 sin a;.
(b) (xx)" = [(l + lax)xx]' = xx~x + xx(l + lna;)2.
(c) (-?L-\W _ _1___6_
In the previous chapter, we were working either with an tremely large class of functions (for example, all continu-s, all differentiable), or with only particular functions, (for ample exponential, trigonometric, polynomial). However j had very few tools. We indicated how to discuss the local haviour of functions near a given point by linear approxi-ition. We learned how to measure instantaneous changes differentiation.
Now we derive several results that will allow us to work th functions more easily when modeling real problems. We io deal with the task of summing infinitely many "infinite-small" changes, in particular, how to "integrate". In the it part of the chapter we come back to series of functions d complete several missing steps in the argumentation so •. We also add useful techniques, how to deal with extra rameters in the functions, and we introduce some further egration concepts briefly.
1. Differentiation
6.1.1. Higher order derivatives. If the first derivative f'(x) of a function of one real variable has a deriva-kf^- five (f')'(xo) at the point x0, we say that the |3g>r second derivative of function / (or second order derivative) exists. Then we write f"(x0) = (f')'(xo) or /^2^(a;o). A function / is two times differentiable on some interval, if it has a second derivative at each of its points. Derivatives of higher orders are defined inductively:
k times differentiable functions
A function / of one real variable is differentiable k times at the point x0 for some natural number k > 1, if it is differentiable (k — 1) times on some neighbourhood of the point x0 and its (k — l)-st derivative has a derivative at the point x0. We write f1-^ (x) for the fc-th derivative of the function f(x).
If derivatives of all orders exist on an interval A, we say the function / is smooth or infinitely differentiable on A.
We use the notation class of functions C* (A) for functions with continuous fc-th derivative on the interval A, where k can attain values 0,1,..., oo. Often we write only C*, if the domain is known from the context. When k = Q, C° means continuous functions.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(d) (_;")(") = [(xny]{n-1) = (na;™-1)^-1) = ■ ■ ■ = n\.
(e) (sin= Re(in sina;) + Im(j" cos a;). □
6.A.2.   Differentiate the expression
^x~T- (x + 2)3 ex(x + 132)2
of variable x > 1.
Solution. We'll solve this problem using the so called logarithmic differentiation. Let / be an arbitrary positive function. We know that
[MO*)]'= tj-   f'(x) = f(x)-[\nf(x)]',
if the derivative f'(x) exists. The usefulness of this formula is given by the fact that for some functions, it's easier to differentiate their logarithm then themselves. Such is the expression in our provlem. We'll obtain
' #_T=T • (x + 2)3
ex(x + 132)2
In
Vx~^l- (x + 2)s ex(x + 132)2
Vx~^l- (x + 2)3
ex(x + 132)2 V^T- (x + 2)3 ex(x + 132)2
J31n(x + 2) + jln(x - 1) - x lne - 21n(x + 132)j
V^T- (x + 2)3 f   3 1       _ . _ 2
ea=(x + 132)2     [x + 2     4(x-l) x + 132
□
6.A.3. Let n G N be arbitrary. Find the n-th derivative of function
2/ = ln±±§, __(-l,l). Solution. With respect of the equality
mi±£ = ln(i + x) -ln(l -a;) , we'll define an auxiliary function
/(a;) := In (ax + 1),    x G (-1,1), a = ±1. For _ G (—1,1) we can easily (sequentially) compute
f'(x) = -2-r,
/(3)(-) = r^,
Based on these results we can figure out that (1)
We'll verify the validity of this formula by mathematical induction. It holds for n = 1,2,3,4, so it suffices to show that its validity for k G N implies its validity for k + 1. Because the direct computation yields
We illustrate the concept of higher order derivatives with polynomials. Because the derivative of a polynomial is a polynomial with the degree one less than the original one, then after a finite number of differentiations we obtain the zero polynomial. If k is the degree of the polynomial, then exactly k + 1 differentiations yields zero. Then derivatives of all orders exist, hence / G C°° (R).
In the spline construction, see 5.1.9, we took care that the resulting functions belong to the class C2 (R). Their third derivatives are piece-wise constant functions. That is why the splines do not belong to C3 (R), even though all their higher order derivatives are zero in all of the inner points of all single intervals in the interpolation. Think this example through in detail!
The next assertion is a combinatorical corollary of Leibniz's rule for differentiation of a product of two functions:
Lemma. If two functions f and g have derivatives of order k at the point x0, then their product also has a derivative of order k and the following equality holds:
(/•S)W(^) = E(-)/(0(Wfc-0(^o).
i=0
Proof. For k = 0, the statement is trivial. For k = 1 it is Leibniz's product rule. Suppose equality holds for some k. Differentiate the right hand side and use Leibniz's rule to obtain the expression
E
f^+1Hx0)g^>(x0) + f^(x0)g
(xo,
In this new sum, the sum of orders of the derivatives of products in all summands is k + 1 and the coefficients of
(A) + .
(fc+i-
are the sums of binomial coefficients
□
6.1.2. The meaning of second derivative. We have already seen that the first derivative of a function is its linear approximation in the neighbourhood of a given point. The sign of a nonzero derivative determines whether the function is increasing or decreasing at the point x0. The points where the first derivative is zero are called the critical points or stationary points of the given function.
VV2WAH P£OT/te£
If xq is a critical point of function /, there are several possibilities for the behaviour of the function / in the neighbourhood of xq. Consider the behaviour of the function f(x) = xn in the neighbourhood of zero for different
373
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
)k-H
(ax+l)k
(-l)k-1(k-l)\ ak (-k) a _ (-l)kk\ ak+1 (ax+l)k+1 ~    (ax+l)k+1 '
(1) holds for all n e N. Then
-T^+TF, aG(-l,l). From here we obtain the result
(ln^)W = (n-l)!(^ pro a G (-1,1) an G N.
(-1)*
(l+x)
□
6.A.4. Determine the second derivative of function y = tg x on its whole domain, i.e. for cos x ^ 0. O
6.A.5. Determine the fifth and the sixth derivative of the polynomial^) = (3a2 + 2a + 1)-(2a-6)-(2a;2 -5a + 9), a g R. O
With shorcuts determine the 12th derivative of function
y = e2x + cos a + a10 - 5a7 + 6a3 - 7a + 3, a G R. O
6.A.7. Write the 26th derivative of function
/(a) = sina + a23 - a18 + 5a11 - 3a8 + e2x, a G R. O
6.A.8. Let / be a given function and let z be a point such that
f(z) = 0,   f'(z) = 0,   f"(z) = 0,   f&(z) = l. Which of the following statements:
(a) the tangent line to the graph of function / at point [z, f(z)] is the a axis;
(b) the function / is not a polynomial of degree two;
(c) the function / is increasing at point z;
(d) the function / does not have a strict local minimum at point z ;
(e) the point z is an inflective point of function /
are necessarily true? O
6.A.9. Let a movement of a solid (a trajectory of a mass point) be described by function
s(t) = -(t - 3)2 + 16, iG[0,7]
in units m, s. Determine
(a) the initial (i.e. at time t = 0 s) velocity of the solid;
(b) the time and location at which the solid has zero velocity;
(c) the velocity and the acceleration of the solid at time t = 4 s.
Recall that velocity is the derivative of trajectory and acceleration is the derivative of velocity. O
n. For odd n > 0, /(a) will be increasing on all R, while for even n it will be decreasing for a < 0 and increasing for a > 0. In the latter case, the function will attain its minimal value among points in the (sufficiently small) neighbourhood
of a0 = 0.
The same argument applies to the function /'. If the second derivative is nonzero, its sign determines the behaviour of the first derivative. At the critical point a0 the derivative /'(a) is increasing if the second derivative is positive and decreasing if the second derivative is negative. If increasing, it is necessarily negative to the left of the critical point and positive to the right of it. In that case, / is decreasing to the left of the critical point and increasing to the right of it. So / attains its minimal value among all points from a (sufficiently small) neighbourhood of a0 at a0.
On the other hand, if the second derivative is negative at a0, the first derivative is decreasing. Thus the first derivative is negative to the left of a0 and positive to the right of it. / then attains its maximal value at xq among all values in a neighbourhood of a0.
A function which is differentiable on (a, 6) and continuous on [a, 6] has an absolute maximum and minimum of this interval. Both can be attained only at the boundary of the interval or at a point where the derivative is zero, Thus critical points may be sufficient for finding extremes. Second derivatives help to determine the type of extreme, if nonzero.
For a more precise discussion of the latter phenomena we consider higher order polynomial approximations of the functions. We return to the qualitative study of the behaviour of functions later on.
6.1.3.
Taylor expansion. As a surprisingly easy use of Rolle's theorem we derive an extremely important result.   It is called the Taylor expansion with i remainder1.
Consider the power series centered at a,
s(x) = j2a"(x-aT-
Differentiate it repeatedly, to get the power series
S« (x) = ^ n(n - 1). .. (n - k + l)an(x - a)n~k.
n=k
Put a = a. Then (a) = klak. We can read the last statement as an equation for ak and rewrite the original series as
oo 1
71 = 0
(NB. A power series can be differentiated term after term, This is proved later.)
^Brook Taylor was an English mathematician (1685-1731) best known for his formalization of the polynomial approximations of functions, recognized by Lagrange as the " main foundation of differential calculus"
374
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Taylor expansions. We necessarily need the derivatives of higher orders to determine the Taylor expansion of a given function.
6.A.10. Determine the Taylor expansions Tk (of fc-th order at point x) of the following functions:
i) Tq of function sin x,
ii) T3 of function
Solution, (i) We'll compute the values of the first, second and third derivative of function / = sin at point 0: /'(0) = cos(0) = l,/(2)(0) = -sin(0) = 0, /(3)(0) = -cos(0) = —1, also /(0) = 0. Thus the Taylor expansion of the third order of functionsin(a;) at point 0 is
1 ,
Tq (sin(a;)) = x--x
6
(ii) Again /(l) = e,
ex
f(2
= 0
2ex
x x
Suppose / is a smooth function instead of a power series. We search for a good approximation by polynomials in the neighbourhood of a given point a.
Taylor polynomial of function /
For a k times differentiable (real or complex valued) function / of one real variable, define its Taylor polynomial of k-th degree centered at a by the formula
Tk,af(x) = /(a) + f'(a)(x - a) + ^f"(a)(x - a)2+
l/(3)(a)(2-a)3 + ... + i/W(a)(x-a)fc.
The mean value theorem is used to show how good the approximation to / it is.
Taylor expansion with a remainder
Theorem. Let f(x) be a function that is k times differentiable on the interval (a, b) and continuous on [a, b}. Then for all x £ (a, 6) there exists a number c G (a, x) such that
f(x) = f(a) + f(a)(x-a) + ...
+ (fc^i)!/(fc"1)(a)(:c ~a)fc_1 + hJ(k){c){x -a)k
= Tk-X<af(x) + ^k\c)(x-a)k.
/(3)d) = --3- +:5r-"-r
0X cx
2    6ex 6ex
-2e
x        x        x-> X*
Thus we get the Taylor expansion of third order of function — at point 1:
The-) = e+e-(x-lf--(x-iy = e(-
y+^~2a:+6) □
Proof. For fixed x define the remainder R, by
f(x)=Tk.ltaf(x) + R.
Then R = ±r{x - a)k for a suitable number r (dependent on x). Consider the function denned by
k—1
6.A.11. Determine the Taylor polynomial Tq of function sin and using theorem (6.1.3), estimate the error of the polynomial at point 7r/4.
Solution. Analogously to the previous example, we compute
?o (sin(aO) = x - ^x3 + ^x5.
Using the theorem 6.1.3, we then estimate the size of the remainder (error) R. According to the theorem, there exists c e (0, f) suchthat
3=0
3-
By the Leibniz rule, its derivative (here x is considered as a constant parameter) is
Efl/(J+1)(C)M
l
R(it/A)
COs(c)7T7 1
7!47      < 7!
0,0002.
(fc-1)! 1
(k-i)
1
(3 - 1)! rix-ZY'1
k-l
fu)(0(x-0
(.x-Z)k-\fW(Z)-r),
r(x-£)
k-l
□
6.A.12. Find the Taylor polynomial of third order of function y = arctg x, i£l
because the expressions in the sum cancel each other out sequentially. Now it suffices to notice that F(a) = F(x) = j(x) (recall that a; is an arbitrarily chosen but fixed number from the interval (a, 6)). According to Rolle's theorem there
375
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
at point a0 = 1.
o
6.A.13. Determine the Taylor expansion of third order at point x0 = 0 of function
l .
cos X '
(a) y ■
(b) y = e'^;
(c) y = sin (sin a);
(d) y = tga;
(e) y = &x sin a;
denned in a certain neighbourhood of point x0.
o
6.A.14. Determine the Taylor expansion of fourth order of function y = In a2, x G (0, 2) at point a0 = 1. O
6.A.15. Find the estimation of the error of the approximation
In (1 + a) « a — ^ for x e (-1,0). O
Write the Taylor polynomial of fourth degree of function y = sin a, a G R centered at the origin. Using this polynomial, approximately compute sin 1° and determine the
limit
lim x slnx~x
X-S-0+ x
o
6.A.17. Determine the Taylor polynomial centered at the origin of degree at least 8 of function y = e2x, a G R. O
6.A.18. Express the polynomial a3 — 2a + 5 as a polynomial in variable u = a — 1. O We'll show some more interesting examples of using the differential calculus. First though, we'll mention the Jensen inequality, which disscusses convex and concave functions and which we'll use later.
exists a number c, a < c < a such that F'(c) = 0. That is the desired relation. □
A special case of the last theorem is the mean value theorem, as an approximation by Taylor series of degree zero. See (1).
6.1.4. Estimations for Taylor expansions. A simple case of a Taylor expansion is when / is a polynomial:
/(a) = anxn + a„_ia"
+ •
+ aia + a0, an/0.
Because the (n + l)-th derivative / is identically zero, the Taylor polynomial of degree n has zero remainder, therefore for each a0 G R
/(a) = /(a0)+/'(a0)(a-a0)+. ■ ■ +l/H(a;0)(a-a0)™.
We can compute all the derivatives easily (for example the last term is always of the form an(x — x0)n).
This result is a very special case of error estimation in Taylor expansion with the remainder. We know in advance that the remainder can be estimated by the size of the derivative, and for polynomials this is identically zero for some order onwards.
More generally, the estimation of the size of the fc-th derivative on some interval can be used to estimate the error on the same interval.
Good examples of an expansion of an arbitrary degree are provided by the trigonometric functions sin and cos. By iterating the differentiation of the function sin a we always have either sine or cosine with some signs. The absolute values do not exceed one. Thus we obtain a direct estimation of the speed of convergence of the power series
la - (Tk,osin)(a) <
ifc+i
(fc + 1)!
6.A.19. Jensen inequality. For a strictly convex function / on interval I and for arbitrary points x1,... ,xn el and real numbers c\,...,cn > 0 sucht that c\ + ■ ■ ■ + cn = 1, the inequality
(71 \ 77
T, CiXi) < J2 Cif(Xi) i=l / i=l
holds, with equality occuring if and only if x1 = ■ ■ ■ = xn.
Solution. Could be proven easily by induction: for n = 2 it is just the definition of the convex function, for the induction
This shows that for a much smaller than k the error is small, but for a comparable with k or bigger it may be large. In the figure, compare the approximation of the function cos a by a Taylor polynomial of degree 68 in paragraph 5.4.11 on the page 338.
As mentioned in the introduction of the discussion of Taylor expansion of functions, if we start with a power series / (a) centered in a, then its partial sums coincide with Taylor polynomials Tk,af(x). The next statement is one of the simple formulations of the converse implication. This is when the given function / (a) is actually a power series on some neighbourhood of the given point a.
376
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
step
/fc+i
Taylor's theorem
\j=i
fc+i
= f[c1x1 + (l-c1)52-
i=2 /fc+1
Cl
< ci/^i) + (i - ci)/ E?__V
\j=2 1
/fc+l
< ci/^i) + (i - Cl) Et^/^
fc+l j=l
where we used the inequality first for n = 2 and then n = k.
□
Remark.
The Jensen inequality can be also formulated in a more -0i> intuitive way: the centroid of mass points {vl»" Vrf placed upon a graph of a strictly convex function lies above this graph.
6. A.20. Prove that among all (convex) n-gons inscribed into a circle, the regular n-gon has the largest area (for arbitrary n > 3).
Solution. Clearly it suffices to consider the n-gons inside of which lies the center of the circle. We'll divide each such n-gon inscribed into a circle with radius r to n triangles with areas 5„ie{l,...,ji} according to the figure. With regard to the fact that
i £i = hi.
e {!,..., n},
we have
\ r2 sinp{, i £ {1,..., n}.
2 2
This implies that the area of the hole n-gon is
n n
S = J2 si = \ r2 E sinpi.
i=l i=l
Thus we want to maximize the sum E3™=i sm V" while f°r values ipi e (0, it) we clearly have
n
(1) H-----h        = E       = 2?r-
j=l
The function y = sin a; is strictly concave on the interval (0,7r), which means, that the function y = — sin x is strictly convex on this interval. Then according to Jensen's inequality
for Ci = 1/n and Xi = ipi, we have
j=i
tj-    sin | E^i] > sin^
Theorem. Assume that the function f(x) is smooth on the interval (a — b, a + b) and all of its derivatives are bounded uniformly by a constant M > 0. So
\f{k)(x)\ < M,   A; = 0,1,..., xe (a-b,a + b).
Then the power series S(x) = E3r^=o H/('fe')(a)(a; ~ a)n converges on the interval (a — b, a + b) to f(x).
Proof. The proof is identical with the the special case of function sin x above. Except that the universal bound by 1 is replaced by M, and thus the estimate of the remainders are
\f(x) - (Tk,af)(x)\ < M
I fc+1
(fc + 1)!
□
6.1.5. Analytic and smooth functions. If / is smooth at a, the formal power series can be written
oo 1
71 = 0
If this power series has a nonzero radius of convergence and simultaneously S(x) = j(x) on the respective interval, we say that / is an analytic function at a. A function is analytic on an interval, if it is analytic at its every point.
Not all smooth functions are analytic. It can be proven that for every sequence of numbers an there is a smooth function, whose derivatives of order k are these numbers ah.2 To show the essence of the problem, we introduce a function which has all its derivatives vanishing at \, zero, but is nonzero at every other point. We see later how useful this function is. Consider the function denned by f(x) = e"1/-2 . It is a well denned smooth function at all points i^0, Its limit at x = 0 exists, and limx^0 f(x) = 0. By defining /(0) = 0, / is a continuous function for all real x.
By a direct computation based on L'Hopital's rule we compute the derivative of / (the first three ones are at the
This is a special case of the Whitney extension theorem, which says that there is a smooth function on a Euclidean space with prescribed derivatives in all points of a closed set A if and only if the Taylor theorem estimates are true for the prescription. In the case of one single point A, the condition is empty. This is relevant for the Taylor theorem for functions of more than one real variable, as in Chapter 8. Hasler Whitney (1907-1989) was a very influential American mathematician.
377
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Moreover, we know the equality occurs exactly for <pi = ■ ■ ■ = ipn. If we express (using (1))
2 71 2 /   71 \ 2
£isin^<^sin £1^ U^sinf, j=i \i=i /
we can see that S can attain at most the value on the right
hand side. But that happens if and only if ipi = ■ ■ ■ = ipn (we
chose xi = ipi). Hence the regular n-gon is the one with the
maximum area, because it satisfies ipi = ■ ■ ■ = ipn = 2ir/n.
□
6.A.21. Isoperimitrie quotient. For a closed curve in plane enclosing a planar region, we define its isoperimetric quotient as the number
where S denotes the area of the region and o its perimeter (i.e. the length of the curve). Hence the isoperimetric quotient determines the ratio of the area of the region and the area of a circle with the same perimeter as the given region. The notation IQ is therefore not only an English abbreviation for the isoperimetric quotient, but can be also thought of as the "intelligence of the region", with which it uses its perimeter for attaining as big area as possible. The isoperimetric theorem then states that for every closed curve, IQ < 1, with equality occuring only for a circle, or ("the circle is the smartest").
Determine IQ for a regular polygon and a circle and find the sector of a circle, for which its boundary has the largest IQ
Solution. First notice that the value of IQ doesn't change with a change of scale on the axes (same on both). Because when the proportions of the region get a times bigger (for arbitrary a > 0), the perimeter also gets a times bigger and the area a2 times (it's a square measure). Hence IQ doesn't depend on the size of the region, but only on its shape. Thus we can consider a regular n-gon inscribed into a unit circle. According to the figure,
h = cos w = cos —,       § = sin w = sin —, which yields
o„ = n ■ x = 2n sin —
and
Sn = n ■ ^ hx = n cos ^ sin ^. Thus for a regular n-gon, we have
t/-^ 47t71 COS — Sill — -tt        . tt
JQ =      4n2 sin2 * "   = ^COtg-,
which we can verify for example for a square (n = 4) with a side of length a, where
picture, guess which line is which!). It suffices to consider only the right derivative, since the function is even.
lim -
X-S-0+
- lim    , , ,
2 x->0+ e1/*2
x
X
x-s-o+ eL'x
0.
By differentiating j(x) at an arbitrary point x =^ 0, j'(x) = e_1/x -2a;~3. By repeated differentiation of the results, there is always a sum of finitely many terms of the form
C ■ e-Vx* -x-j,
where C is an integer and j is a natural number.
Next, assume it is already proven that the derivative of order k of j(x) exists and vanishes at zero. Compute the limit of the expression f^k\x)/x for x —> 0+. Thisisafinite sum of limits of the expressions e-1/x = x~i/ e1^ . All these expressions are of type oo/oo, so L'Hopital's rule can be used repeatedly on them. After several differentiations of both the numerator and denominator (and a similar adjustment as above) there remains the same expression in the denominator, while in the numerator the power is non-negative. Thus the expression necessarily has a zero limit at zero, just as in the case of the first derivative above. The same holds for a finite sum of such expressions. So each derivative f1-^ (x) at zero exists with value zero.
In summary, j(x) is smooth on the whole of R. It is strictly positive everywhere except for a; = 0. All its derivatives at this point are zero.
It cannot be analytic at x0 = 0. The limit of the function at the improper points ±oo is 1, while all its derivatives converge quickly to zero.
One of the most important functions in Mathematics and Physics is the Gaussian function
g(x) = f(x-1)=e-x\
We see why in Chapter 10 when dealing with probability and statistics.
Replace x with —x2 in the power series for the exponential. It follows that g is an analytic function on the entire R. Its derivatives are easily computed. In particular g' (0) = 0 is the only singular point and g"(0) = —2. Since the limits at ±oo are zero, the number g(0) = 1 is the global maximum of the Gaussian function. Both functions are depicted on the diagram. j(x) is the solid line, while the Gaussian function is dashed.
__^ /	s \--
	
	\ yr
	\ /
	\ /
\ /	\ /
V 04	Y
A	A
/ \ °"	/ \
/ \	/ v
	/          ^ ~-~ —
378
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Using the limit transition for n —> oo and the limit
lim SmiE = 1,
we get the isoperimetric quotient for a circle:
IQ = lim 2L Cotg 2L = lim ^-^ =        = 1.
n—too n n n—too 1
Of course, for a circle with radius r, we could have also directly computed
IQ = *&-
4irS _ 4tt (>r2)
(2irr)2
1.
For the boundary a of sector of a circle with radius r and central angle p £ (0,27r), we have
AitS _    4tt -
IQ =
__ _ 2-7ry
(2r+r/)2 _ (2+V)2 "
Hence we're looking for a maximum of the function
f(<P) ■■= By computing
f'(p) = 27T-±£)!
27ry
(2+ÍT7
-2y(2+y) _
p £ (0, 27r).
2tt
(1^ ^(0,2*)
we easily obtain that
/'(v)>0,  ve(o,2),     f'(p)<o, pe(2,27r).
Hence function / attains its maximal value for po = 2 and for a central angle i^o = 2 (radians), we get the largest
IQ
_ 2-jryo _ 7t
~~ (2+^d)2 ~~ 4"
For the sake of completeness, for a solid in three-dimensional space (more precisely, for the closed surface which is its boundary), we define
IQ :=
where V is the volume and S the surface of the solid. Thus we compare the volume of the solid with a given surface with the volume of the ball with the same space.o □
6. A.22. A string of length / is given. The task is to cut it into n parts so that it's possible to create boundaries of geometric figures given in advance (for example a square, a triangle, a circle, a halfcircle) with the least sum of areas from the n smaller strings.
Solution. To solve this problem, we'll use the isoperimetric quotient of curves and Jensen's inequality (stated in previous examples). For the geometric figures given in advance, denote the values of their isoperimetric quotients as
£:=*fS       i£{l.....«},
where Si is the area and o{ the perimeter of the i-th figure. We'll also use the denotation
Notice how the function / touches the x axis due to the vanishing of all derivatives at zero.
6.1.6. Useful non-analytic smooth functions. The smooth functions are very "elastic" — from a local behaviour around one point we cannot deduce anything at all about the global behavior of such function. On the other hand, analytic functions are completely determined just by derivatives at one point. In particular they are completely determined by their behaviour on an arbitrarily small neighbourhood of a single point from their domain. In this sense, analytic functions are very "rigid".
In particular, the smooth functions allow for joining different constant values on disjoint open intervals in a differen-tiable way.
Let us look at such functions more closely now. We can modify f(x) from the previous paragraph in this way:
10 ifx<0 e ~xlxi   if x > 0.
Again it is a smooth function on all of R. By another modification there is another function h, which is nonzero at all inner points of the interval [—a, a], a > 0 and zero elsewhere.
0 if \x\ > a
h{x)
e^^ + ^   if \x\ < a.
This function is again smooth on all of R. The last two functions are in the two figures. On the right, the parameter a = 1/2 is used.
-O.K   -OjS -0.4
Finally we show how to get smooth analogies of the Heav-iside functions. For two fixed real numbers a < b, define the function j(x) exploiting the above function g as follows:
f(x) =_g(x - a)_.
g(x - a) + g(b - x)
For all x £ R the denominator of the fraction is positive (because g is non-negative. For each of the three intervals determined by numbers a and b at least one of the summands of the denominator is nonzero). Thus the definition yields a smooth function j(x) on all of R. For x < a the numerator of the fraction is zero according to the definition of g. For x > b the numerator and denominator are equal. In the next two figures there are functions j(x) with parameters a = 1 — a, b = 1 + a. On the left a = 0.8, and on the right a = 0.4.
379
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
i=l
Recall that the isoperimetric quotient is given only by the shape of the figure and doesn't depend on its size. In particular, the value a is constant (it's determined by the shapes of the given figures).
Our task is to minimize the sum E™=i ^ w^ E™=i °i — Because
we need to minimize the expression
n 2
S := 4^ E AT' i=l
Using Jensen's inequality for the strictly convex function y = x2 (on the whole real axis), we obtain
n \ 2 n
yj C{ X{ J    ^ yj C{ X^ i=l / i=l
for i, £ 8 and c{ > 0 with the property c\ + ■ ■ ■ + cn = 1. Moreover we know that the equality occurs if and only if By choosing
x1
we then get
A ■
f-, ie{i,...,ti},
A X^
= 1 /        i=l v
By several simplifications, we obtain the inequality
/ n
— A A,
and then (notice that Yl7=i °i — 0
A —        X, '
with equality again occuring for
(1)      Xl = ■■■ = xn, tj.
Ol_
Ai
On Xn
This implies that S the smallest, if and only if (1) holds. This smallest value of S is I2/(AirA). Now we only need to determine the lengths of the cut parts o{. If (1) holds, then clearly o{ = k\{ for alH G {1,... ,n) and certain constant k > 0. From
n n n
°i — I   and simultaneously       °i — *     ^ = kA,
i=l i=l i=l
we can immediately see that k = I/A, i.e.
o{ = ^l, i £ {1,..., n}. Let's take a look at a specific situation where we are to cut a string of length 1 m into two smaller ones and then create a square and a circle from them so that the sum of their areas is the smallest possible. For a square and a circle (in order), we have (see the example called Isoperimetric quotient)
Finally, we can create a smooth analogue of the characteristic function of any interval [c, d].
Write fe (x) for the latter function j(x) with parameters = —e, b = +e. For the interval (c, d) with the length - c > 2e define the function he (x) = fe (x — c) ■ fe (d — . This function is identically zero on the intervals (—oo, c— a (d + e, oo). It is identically one on the interval (c + d — e). Moreover, it is smooth everywhere. Locally it is either constant or monotonie (you should verify the last claim yourself). The smaller the e > 0, the faster he (x) jumps from zero to one around the beginning of the interval or back at the end of it.
The diagram shows the choices [c, d] = [1,2] and e = 0.6, e = 0.3.
6.1.7. Local behaviour of functions. It is time to return to 'rw - the behaviour of real functions of one real variable. We have seen that the sign of the first derivative of a differentiable function W ' determines whether it is increasing or decreasing on some neighbourhood of the given point. If the derivative is zero, it does not of itself say much about the behaviour of the function.
We encountered the importance of the second derivative when describing critical points. Now we generalize the discussion of critical points for all orders. First we deal with the local extremes of functions.
In the following we consider real functions with a sufficiently high number of continuous derivatives, without specifically stating this assumption.
The point a in domain of / is a critical point of order k if and only if
/'(a) = ... = /W(a) = 0, /<fc+1)(a)^0.
380
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Ai = |,   A2 = l,      tj.      a = x1 + x2 = ^-. Then the lengths of the respective parts are (in metres)
°i= _£_■•! = 4^F = 0,56, 02 = ^-1 = ^ = 0,44. The area of a square with perimeter 0,56 m (with a side of lengtha = 0,14m) isO, 0196 m2 and the area of a circle with perimeter 0,44 m (and radius r = 0,07 m) is approximately
0, 015 4 m2. We can verify that (in m2
-r—r = -rru—, = 0, 035 = 0,019 6 + 0,015 4.
□
6.A.23.   Expand the function
(a) j/ = ln±±§,   x e (-1,1);
(b) y = ex2 + x2e~2x,   a G R
into a Taylor series centered at the origin..
Solution. If the function can be expressed as a sum of a power series (with a positive radius of convergence) on its domain of convergence, then this series is necessarily the Taylor series of the given function (its sum). This allows us to find the corresponding Taylor series easily. Case (a). We know that
in(1 + 1)= E {-z^r—xn, xe(-i,i),
71=1
1. e. for x G (—1,1) we have
Wi-zH E        (-*)" = - E k*n-
77 = 1 77=1
In total, for a; G (—1,1) we have
lni±f = ln(l + a)-ln(l-a)= E ^C^1 xn
77=1
OO
_ 2 „277-1
—    2^    277-1 't 77=1
Case (b). Similarly, the well known identity
oo
ex = E hxn'   x GK'
77 = 0
implies
oo oo
e   = E     ix2T = E ^2", xeB,
77 = 0 77 = 0
and
oo oo    , ,„
X2e-2X = X2 E ^(-2*)"= E ^L^n+2, xei
Hence
e^2+x2e-2-= E -2"+(-re2p"+2,
Suppose (a) > 0. Then this continuous derivative is
positive on a certain neighbourhood 0(a) of the point a as well. In that case, the Taylor expansion with the remainder gives
f(x) = f(a) +
(k + 1)
■f(k+1\c)(x-a)
k+l
□
for all x in O(a). Because of that, the change of values of f(x) in a neighbourhood of a is given by the behaviour of (x — a)k+1. Moreover, if k + 1 is an even number, then the values of /(a) in such a neighbourhood are necessarily larger than the value/(a). So a is a local minimum. But if k is even then the values on the left are smaller, while those on right are larger than f(a). So an extreme does not occur even locally. On the other hand, the graph of the function /(a) intersects its tangent y = f(a) at the point [a, f(a)] in the latter case.
Similarly, if (a) < 0, then it is a local maximum for odd k, and there is no extreme for even k.
6.1.8. Convex and concave functions. The differentiable function / is concave at a, if its graph lies completely below the tangent at the point [a, f(a)] in a neighbourhood of a. That is,
f(x)<f(a) + f(a)(x-a).
Similarly / is convex at a, if its graph is above the tangent at the point a. That is,
/(a) >f(a) + f(a)(x-a).
A function is convex or concave on an interval, if it has this property at all its points.
Suppose / has continuous second derivatives in a neighbourhood of a. The Taylor expansion of second order with the remainder implies
/(a) = f(a) + f'(a)(x - a) + \f"(c)(x - a)2.
Then the function is convex, whenever f"(a) > 0, and concave whenever f"(a) < 0.
If the second derivative is zero, we can use derivatives of higher orders. But we can only make the same conclusion if the first other nonzero derivative after the first derivative is of even order. If the first nonzero derivative is of odd order, the points of the graph of the function on opposite sides of some small neighbourhood of the studied point will lie on opposite sides of the tangent at this this point.
381
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.A.24. Determine the Taylor series centered at the origin of function
(a) y ■
e(-i,i);
(l+x)2
(b) y = arctg a;,   x e (-1,1). Solution. Case (a). We'll use the formula
oo oo
ife = E (-*)" = E(-i)v, xe(-i,i)
n=0 n=0
for the sum of a geometric series. By differentiating it, we obtain for x e (—1,1)
(oo \ ' oo
E(-l)^n    = Et-l)""*"-1. n=0 / n=l
with (a;0)' = 0, thus the lower index is n = 1. We can see that
lw= E(-i)
„ ™n—1
nar    , a;
e(-i,i).
Case (b). We can express the derivative of function y = arctg t for t e (—1,1) as
oo oo
(arctgi)' =       = E HT = E (-m2n,
n=0 n=0
Because for a; e (—1,1) we have
X
J (arctg t)' dt = arctg x — arctg 0 = arctg x
o
and
x  / oo \ oo    / x
J E (-i)™*2™ ^ = E (-i)njt2ndt
0   \n=0 ) n=0 \ 0
— ™2n+l 2n+l '
n=0
we already have the result
arctg a; = E ^+Tx2n+\   * e (-1,1).
n=0
□
6.A.25. Find the Taylor series centered at x0 = 0 of function
f(x) = fucosu2du,    x £ o
Solution. The equality
cosi= E &i2n, *e
implies
oo, 0 oo.
and then (for x e K)
2,7„. _  f /  v-  £zi)_u4n+l i du
/(e) = fucosu2du = / Ek
0 0 \n=0
= En( w/u4n+idu) =  E  (2n)!(4n+2) ^
™4n+2
1
6.1.9. Inflection points. A point a is called aw inflection point of a differentiable function /, if the graph of / crosses from one side of the tangent in the point a to the other. The latter discussion on concave and convex functions shows that the inflections can appear only at points with vanishing second derivative.
Suppose / has continuous third derivatives and write the Taylor expansion of third order with the remainder:
f(x) = f(a)+f(a)(x-a)
^/"(a)(x-a)2^/'"(c)(x-a)3.
If a is a zero point of the second derivative such that /"' (a) =^ 0, then the third derivative is nonzero on some neighbourhood, a is an inflection point since the second derivative changes the sign at a and thus the tangent crosses the graph. In that case, the sign of the third derivative determines whether the graph of the function crosses the tangent from the top to the bottom or vice versa.
Moreover, if a is an isolated zero point of the second derivative and simultaneously an inflection point, then on some small neighbourhood of a the function is concave on one side and convex on the other. Thus the inflection points are points of the change between concave and convex behaviour of the graph of the function.
6.1.10. Asymptotes of the graph of the function. We intro-<gw        duce one more useful utility for understanding/s-'^Ty^   ketching the graph of a function. We consider
for x —> x0.
Thus, an asymptote at the improper point oo is a line y = ax + b, which satisfies
lim (f(x) — ax — 6) = 0.
x—too
An asymptote with a slope. If such an asymptote exists, it satisfies
lim (f(x) — ax) = b
x—too
Consequently the limit
the asymptotes. These are lines in R2 whose distance from the graph of f(x) converges to zero
lim
/0»
also exists.
382
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
6.A.26. Approximately compute cos with an error lesser than 10"5. O
6.A.27. Without computing the derivatives, determine the Taylor polynomial of degree 4 centered at x0 = 0 of function
f(x) = cos x — 2 sin a; — In (1 + a;) , a; G (—1,1). Then decide if the graph of function / in neighbourhood of the point [0,1] is above or below the tangent line. O Now we'll state several "classical" problems, in which we'll determine the course of distinct functions. By determining the course we mean
(a) the domain (it's given) and the range;
(b) eventual parity and periodicity;
(c) discontinuities and their kind (including the according one-sided limits);
(d) points of intersections with the axes x, y;
(e) the intervals where the function is positive and where it's negative;
(f) the limits lim^.oo f(x), lim__i.+00 f(x);
(g) the first and the second derivatice;
(h) the critical and the so called stationary points, at which the first derivative is zero (eventually the points, at which the first or the second derivative don't exist)
(i) the intervals of monotonicity;
(j) strict and nonstrict local and absolute extremes;
(k) the intervals where the function is convex and where it's
concave; (1) the points of inflection; (m) the horizontal and inclined asymptotes; (n) values of the function / and its derivative /' at significant" points; (o) the graph.
6.A.28.   Determine the range of function
f(x) = ^k,   x eR.
Solution. The line y = 1 is clearly an asymptote of function / at +oo and the line y = —1 is an asymptote at —oo, because
lim
lim !
The inequality
i,
2ex
lim
e -1
,e*+l
0-1 0+1
-1.
Conversely, if the last two limits exist, the limit from the definition of the asymptote exists as well, thus these are sufficient conditions too.
The asymptote at the improper point — oo is defined similarly.
In this way we find all the lines satisfying the properties of asymptotes with slope. It remains to consider lines perpendicular to the x axis:
The asymptotes at points a e K are lines x = a such that the function / has at least one of the one-sided limits at a has infinite value. They are called asymptotes without slope.
The rational functions have asymptotes at all zero points of the denominator which are zero points of the numerator as well.
We consider a simple illustrative example: Let
f(x) = x +
1
. / has two asymptotes y = x and x = 0. Indeed, the onesided limits from the right and left at zero are clearly ±oo, while the limit f(x)/x = 1 + 1/a;2 is of course ±1 at the improper points. Finally the limit of f(x) — x = 1/a; is zero at the improper points. By differentiating,
f'(x) = l-x-2,    f'{x) = 2x-\
The function f'(x) has two zero points ±1. At a; = 1, / has a local minimum. At a; = — 1, / has a local maximum. The second derivative has no zero points in all its domain
(—oo, 0) U (0, oo), so / has no inflection points.
(e* + l)
> o,  x e
6.1.11. Differential of a function. In practical use of differential calculus, we often work with dependencies between several variables, say y and x. :'~ .: The choice of dependent and independent variable is not fixed. The explicit relation y = f(x) with some function / is then only one of possible options. Differentiation then expresses that the immediate change of y = f(x) is proportional to the immediate change of x with the proportion of f(x) = §{x).
383
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
then implies that / is continuous and increasing on R. Hence the range is the interval (—1,1). □
_ 2
6.A.29. Find all intervals on which the function y = e x , x e R is concave. O
6.A.30. Consider function
y = arctg^i, x^0(xeR).
Determine intervals on which this function is convex and concave and also all its asymptotes. O
6.A.31. Find all asymptotes of function
(a) y = x ex;
w y - (x-2Y
with maximal domain.
6.A.32. Find the asymptotes of function
y = 2arctg   ^ |, i/±l(iel).
6.A.33. Consider function
y = In 3e"+^ + 10
denned for all real x. Find its asymptotes. 6.A.34.   Determine the course of the function
f(x) = V\x\3 + 1.
Solution. The domain is the whole real axis, / has no discontinuities. For example it suffices to consider that the function y = y/x is continuous at every point i£l (unlike even roots denned only on the nonnegative axis). We can also immediately see that f(x) > 1 and f(—x) = f(x) for all i£l, i.e. the function / is positive and even. Thus we can obtain the point [0,1] as the only intersections of the graph of / with the axes by substituting x = 0. The limit behavior of the function can be determined only at ±oo (there are no discontinuities), where we can easily compute (1)
lim   \J\ x 3 + 1 =   lim   {/\ x |3 =   lim    x | = +oo.
x—s±oo x—s±oo x—s±oo
Now we'll step up to determing the course of a function by using its derivatives. For x > 0, we have
This relation is often written as
df(x) = ^(x)dx,
where we interpret df(x) as a linear map denned on increments of x at the given point, df(x)(Sx) = f'(x) Sx, while dx(x)(Sx) = Sx.
We talk about the differential of function f if the following approximation property is true:
lim f(x + Sx)-f(x)-df(x)(Sx) = Q
<5x-s-0 Sx
Taylor theorem then implies that a function with bounded derivative /' has the differential df. In particular, this happens at the point x if the first derivative f'(x) exists and is continuous at x.
If the quantity x is expressed by another quantity t, e.g. x = g(t) and, moreover, g has continuous first derivatives again, the chain rule for differentiating the composite functions says that fog has the differential too and
df dr
df(t)=d(fog)(t) = -L(x)-(t)dt.
Therefore df can be seen as a linear approximation of the given quantity dependent on the increments of the dependent variable, no matter how this dependence is given.
O 6.1.12. The curvature of the graph of a function. We shall conclude this section with two straightforward applications of differentials. First we discuss curves in the plane and space, starting with the graphs of functions. Then we provide a brief introduction to the numerical procedures for differentiation (jump to 6.1.16 if getting tired with the curvatures).
Imagine the graph of a function as a movement in the plane parametrized by the independent variable x. The vec-
o
o
f(x) = \/x3 + 1 = (a;3 + 1) =
hence (2)
/'(*) =
(x3 + 1)
3xz
(x3 + iy
> o,
x > 0.
tor (\,f'(x)) e R2 represents the velocity at x of such a movement. The tangent line through [x, f(x)] parametrized by this directional vector then represents a linear approximation of the curve. The goal is to discuss how "curved" is the graph at x. This is a straightforward exercise working with differentials in the setup of elementary plane geometry. It might need some effort to keep the overview.
If f"(x) = 0 and simultaneously f"'(x) ^ 0, the graph of the function / intersects it's tangent line. In such a case, the tangent line is the best approximation of the curve at the point x up to the second order as well. We describe this by saying that the graph of / has zero curvature at the point x.
The nonzero values of the first derivative describe the speed of the growth. Intuitively we expect the second derivative to describe the acceleration, including how "curved" the graph is. As a matter of convention, we want the curvature to be positive if the graph of the function is above its tangent.
The tangent at a fixed point P = [x, f(x)] is the limit of the secants. The lines passing through the points P and Q = [x + Ax, f(x + Ax)]. To approximate the second derivative, interpolate the points P and Q ^ P by the circle Cq, whose
384
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
This implies that / is increasing on the interval (0, +00). With respect to its continuity at the origin, it must be increasing on [0, +00). Because it's an even function, we know that on the interval (—00,0] it must be decreasing. Thus it has only one local minimum at point x0 = 0, which is also a (strict) global minimum. Because a nonconstant continuous function maps an interval to an interval, the range of / is exactly [1, +00) (consider f(x0) = 1 and (1)). Notice that thanks to the even parity of the function, we didn't have to compute the derivative /' on the negative half-axis, which can though be easily determined by substituting x 3 = (—x)3 = —x3, yielding
/'(x) = i(-x3 + l)' 0,
Hz2) = -x < 0.
{/(-*3+i)
When computing /'(0), we can proceed according to the definition or we can use the limits
lim
= 0 = lim--. x
-iy x^O-      3/(_a;3 + 1)
x^0+ ^/{x3 + l
determine the one-sided derivatives and then /'(0) = 0. In fact, we didn't even have to compute the first derivative on the positive half-axis either. To obtain that / is increasing on (0, +00), we only needed to realize that both functions y = $x and y = x3 + 1 are increasing on R and a composition of increasing functions is again an increasing function.
For x > 0, we can easily compute the second derivative using (2)
= 2x^(x3 + l)2-fx2^(x3+l)-1(3x2)
i.e. after a simplification we have
(3) f"(x)
2x
> 0,      x > 0.
^3 + l)5 Similarly we can compute f"(x)
2X^1 (-X* + l)2 - %X2 ^/(_X3 + 1)-1 (_3;c2)
\/(-z3 + l)4
2x
>0,
for x > 0 and then /"(0) = 0. Next, we can use a limit transition:
lim    , 2x      = 0 = lim--, 2x
X^0+   ^/(x3+1)5 x^O-        ^/(_x3 + 1)5
According to the inequality (3), / is strictly convex on the interval (0, +00). Also / must be strictly convex on (—00,0).
center is at the intersection of the perpendicular lines to the tangents at P and Q.
It can be seen from the figure that if the angle between the tangent at the fixed point P and the x axis is a and the angle between the tangent at the chosen point Q and the x axis is a + Aa, then the angle of the latter perpendicular lines is Aa JSmT_ should only
as well p/i a a perhaps make
the measured arc
Denote the radius of the circle by p. Then the length of Mc^-the arc between points P and Q is pAa. As Q approaches the fixed point P, the length Aa of the arc approaches the length As of the curve between P and Q, that is, the graph of the function j(x). At the same time the circle approaches some circle Cp. Thus we arrive at the basic relation for the expected radius p of the circle Cp in terms of the linear approximations of the quantities:
P :
As ds lim —— = —.
zia-s-o Aa da
Notice that the quantity on the right hand side is well denned (independently of its rather intuitive justification).
Define the curvature of the graph of the function / at the point P as the number 1/p. Zero curvature then corresponds to an infinite radius p.
For computing the radius p in terms of / we need to express the length of the arc s by the change of the angle a and express the derivative of this function by the derivative of /.
Notice that for an increasing angle a the length of the arc can either increase or decrease, depending on whether the circle Cq has its center above or below the graph of the function /. The sign of p then reflects whether the function is concave or convex. There is also the special case when the center "runs off' to infinity in the limit. Instead of a circle there is the tangent line.
There is no direct tool to compute the derivative j^. However, tg a = ^. By differentiating this equality with respect to x we obtain (using the chain rule for differentials)
1
da
(cos a)2 dx
f"
385
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
To obtain this conclusion though, we again didn't have to compute the second derivative for x < 0, it sufficed to use the even parity of the function. In total, we obtained that / is convex on its whole domain (it doesn't have any inflection points).
To be able to plot the graph of the function, we still need to find the asymptotes (we leave the computation of values of the function at certain points to the reader). Since / is continuous on R, it can't have any horizontal asymptotes. A line y = ax + b is an inclined asymptote for x —> oo if and only if both (proper) limits
lim        = a,        lim (f(x) — ax) = b.
x—soo    x x—soo
exist. Analogous statement holds for a; —> — oo. Hence the limits
lim
x—soo
= lim -
x—soo
lim ■
x—soo
= 1,
lim (f(x) — 1 ■ x) = lim (va?+T — x) =
x—too x—too
lim ([&*+T-x] y^W^\ = lim x3+i-x3-- = Um   i   = o
x^oo y(x3 + l)*+xjx3 + l+x*       x^oo Ax
imply that the line y = x is an asymptote at +oo. If we again consider the fact that / is even, we'll immediately obtain the line y = —a; as an asymptote at — oo. □
6.A.35.   Determine the course of the function
f(x) —   cos X J V   /       cos 2x '
Solution. The domain consists of exactly those x G R, for which cos 2x ^ 0. The equality cos 2a; = 0 is satisified exactly for
2a; = § + k-ir, k G Z,   tj.   i = f + f, J; £ Z.
Hence the domain is
Clearly we have
{f + !f; keZ}.
cos( — 2x)
/(*)
for all x in the domain, thus / (with its domain symmetric with respect to the origin) is an even function, which was implied by the even parity of the function y = cos a;. Moreover, because cosine is periodic with a period of 2ir (i.e. y = cos 2a; has a period of it), it suffices to consider the function / for
x G V := [0,tt] \ {f + !f; k G Z} = [0,f)u(f,f)u(3f^],
On the left hand side we can substitute
-^—5 = l + (tga)2 = l + (/')2 (cos ay
which implies (see the rule for differentiating inverse functions)
dx da
l + (tga)2 _ l + (/')2
Now, we are almost finished, because the increment of the length of arc s dependent on x is given by the formula
ds
dx
= (l + (/')2)1/2-
Thus, by the chain rule,
ds _ ds dx _ (1 + (J')2)3/2
p -■
da     dx da
f"
The result explains the relation between the curvature and the second derivative. The numerator of the fraction is always positive. It equals the third power of the length of the tangent vector of the given curve. The sign of the curvature is therefore given only by the sign of the second derivative, which confirms the ideas about concave and convex points of functions.
If the second derivative is zero, the curvature 1/p is also zero. If /" is large, then the radius p is small, thus the curvature is large as well.
The circle, by which curvature is defined is called the osculating circle.
Compute the curvature of simple functions yourself and use osculating circles while sketching their graphs. The computation at the critical points of the function / is easiest. The radius of the osculating circle is the reciprocal value of the second derivative with the corresponding sign.
6.1.13. Vector differential calculus. As mentioned already in the introduction to chapter five, most consid-m, erations related to differentiation are based on the fact that the functions are defined on real numbers and that their values can be added and multiplied by real numbers. That is why functions / : R —> V need to have values in a vector space V. We call them vector functions of one real variable or more briefly vector functions.
To end this section, we digress to consider functions with values in the plane or in space. Thus, / : R —> R2 and / : R —> R3. We consider (parametrized) curves in plane and space. We could work with values in R™ for any finite dimension n.
For simplification, we work with the fixed standard bases e{ in R2 and R3. So curves are given by pairs or triples of real functions of one real variable, respectively. The vector function r in plane or space, respectively, is given by
r(t) = x(t)e1 + y(t)e2,    r(t) = x(t)e1 + y(t)e2 + z(t)e3.
The derivative of such a vector function is a vector, which approximates the map r by a linear map of the real line to the plane or to the space.
386
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
since the course of the function on its whole domain can be derived using its even parity and periodicity with a period of 2tt.
Hence we'll only be concerned with the discontinuities xi = 7t/4 and x2 = 3tt/4. We'll determine the corresponding one-sided limits
lim
lim
+00, = +00,
lim
+ '
lim cosx
x^3f +
cos 2x
cos 2x
If we have a respect to the continuity of / on the interval (7t/4, 37r/4), we can see that / attains all real values on this interval. Hence the range of / is the whole R. We also found out that the discontinuities are of the second kind, where at least one of the one-sided limits is improper (or doesn't exist). By that, we simultaneously proved that the lines x = tt/4 and x = 37r/4 are horizontal asymptotes. If we'd want to formulate the previous results without a restriction to the [0, it], we can say that at all points
7t i klT 4  "T"    2 '
xk = t + ^,     k e z
/ has a discontinuity of the second kind and every line
X —      -\- ~2~: k G ^
is a horizontal asymptote. Also the periodicity of / implies that no other asymptotes exist. In particular, it cannot have any inclined asymptotes, nor can the (improper) limits limx^+00 f(x), limx^_oo f(x) exist. Now we'll find the points of intersection with the axes. The point of intersection [0,1] with the y axis can be cound by computing /(0) = 1. When looking for the points of intersection with the x axis, we consider the equation cos x = 0, x e V with the only solution being x = tt/2. Then we can easily obtain the intervals [0, 7r/4), (tt/2, 37r/4), on which / is positive, and the intervals (ir/4,ir/2), (3tt/4, it], where it's negative. Now we'll step up to computing the derivative
— sin x cos 2x — 2 cos x (— sin 2x)
cos2 2x
„2 „ „;„2
— sin x (cos x — sin x j + 2 cos x (2 sin x cos; cos2 2x
sin x + 3 cos x sin x cos2 2x
(sin2 x + cos2 x + 2 cos2 x) sin x
s2 2x
(2 cos x + 1) sin a;
In the plane it is
dr
dt
(t) = r/(t)=x/(t)e1+y/(t)e2
s2 2a;
-, x e V.
and similarly in space.
The differential of a vector function in this context is:
(dx       dy       dz \
where the expression on the right hand side is understood as "selecting" an increment of the scalar independent variable t and mapping it linearly by multiplying the vector of the three derivative components. Thus the corresponding increment of the vector quantity r is obtained (of course, only two components in the plane).
The notation r (t) is a convenient way to describe curves in space. For example r(t) = (a cosi, a siní, bt) orr(í) = acosíei + asiníe2 + be3 for fixed constants a, b describes a circular helix. Here the parameter t is related to a suitable angle measured around the z — axis. The derivative of r(t) at t = to, determines the direction of the tangent line at r(t0). In Newtonian mechanics, the parameter t can stand for time, measured in suitable units. In this case the derivative of r(t) at time t = t0, gives the velocity vector at the same time. The second derivative then represents the acceleration vector at the same time.
6.1.14. Differentiating composite maps. In linear algebra jj> 1. and geometry there are very useful special maps called forms. They have one or more vectors as their I arguments and they are linear in each of their argu-]^ ments. In this way we defined the length of the vectors (the dot product is a symmetric bilinear form) or the volume of a parallelepiped (this is an n-linear antisymmetric form, where n is the dimension of the space), see for example the paragraphs 2.3.22 a 4.1.22.
Of course, we insert vectors r(t) dependent on a parameter as the arguments of these operations. By a straightforward usage of the Leibniz rule for differentiation of a product of functions, the following is verified:
Theorem. (l)Ifr(t) : R —> R™ is a differentiable vector and 9 : Rn —> Rm is a linear map, then the derivative of the map 9 or satisfies
d{9 o r) dr dt dt
(2) Let there be differentiable vectors r\, rk : R —> R™ and a k-linearform 4> : Rn x ... x Rn on the space R™. The the derivative of the composed map
<p(t)=$(r1(t),...,rk(t))
satisfies the (generalized) Leibniz rule
dp     ^ , dr\ , ,, drk N
^=#(-S-,r2l...,rfc) + -+#(rll...lrfc_1>—).
(3) The previous statement remains valid even if & also has values in the vector space, and is linear in all its k arguments.
387
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
The points at which f'(x) =0 are clearly the solutions of the equation sin x = 0, x G V, i.e. the derivative is zero at points x3 = q,Xi = it. The inequalities
2 cos2 x+1 > cos2 2x > 0,    sinx > 0,       x G Pn(0,7r)
imply that / is increasing at every inner point of the set V, thus / is increasing on every subinterval of V. The even parity of / then implies that it's decreasing at every point x G (—7r, 0), x =^ —37r/4, _ — 7r/4. Hence the function has strict local extremes exactly at the points
ifc = fc7T, fc G Z.
With respect to periodicity of /, we uniquely describe these extremes by stating that for x3 = xq = 0, we get a local minimum (recall the value of the function / (0) = 1) and for x4 = x1 = it, a local maximum with the value / (it) = —1. Let's compute the second derivative
/.// /   \        [4 cos x( — sin x) sin x-\- (2 cos2 ~ + l) cos x] cos2 2x
t  (x) = J-----v=-'-1-
J    \   / cos 2x
4 cos 2x{— sin 2x) (2 cos2 ~ + l) sin x cos4 2x
[10 sin2 x cos2 x+2 cos4 x+cos2 x+4 sin4 x+7 sin2 x] cos x cos3 2x '
Note that after a few simplifications, we can also
express
put   \       (3+4 cos2 x sin2 x+8 sin2 x) cos x —.
/ (x) = A-^c^-Z-.      x ^T>
or
/.///   \       f 11— 4 cos4 x—4 cos2 x) cos x _ —,
r (_) = -J-;;-'-,      x eT>.
J    \   / cos  2x '
Since
10 sin2 x cos2 x + 2 cos4 _ + cos2 _ + 4 sin4 _ + 7 sin2 x > 0, i.I,
or
3 + 4 cos2 x sin2 x + 8 sin2 _ = 11 — 4 cos4 _ — 4 cos2 _ >
3, l£l
respectively, we have /"(a;) = 0 for certain x e V if and only if cos x = 0. But that's satisfied only by x5 = tt/2 G X>. It's clear that /" changes its sign at this point, i.e. it's a point of inflection. No other points of inflection exist (the second derivative /" is continuous on V). Other changes of the sign of /" occur at zero points of the denominator, which we have already determined as discontinuities x1 = tt/4 and x2 = 37r/4. Hence the sign changes exactly at points x\, x2, x5, thus the inequality
Proof. (1) The linear maps are given by a constant matrix of scalars A = (a^) so that
_ o r(t) = I ^auriit),..., '^amiri{t)
Carry out the differentiation separately for individual coordinates of the result. However, the derivative acts linearly with respect to scalar linear combinations, see Theorem 5.3.4. That is why the derivative is obtained simply by evaluating the original linear map _ on the derivative r'(t).
(2) The second statement is obtained analogously. Write out the evaluation of the fc-linear form on the vectors r\, rk in the coordinates in this way:
n
_(n (.),..., rfc(.))= B.i....k'(nk(i)...(r4W-
where
<£(eu,.
the  scalars Bt-
are  given  as  the value
, eik) of the given form on the chosen fc-tuple
of base vectors for every choice of indices. The rule for differentiating a product of scalar functions then yields the statement.
(3) If <P has vector values, it is given by finitely many components and the previous result can be used for each of them □
In the Euclidean space R3, the scalar product assigns a scalar to two vectors, There is also the vector product, which assigns the vector u x v G R3 to vectors u and v, see 4.1.24. This vector u x v is orthogonal to both vectors u and v, its length equals the area of the parallelogram determined by u and v (in this order) and the orientation is such that the triple ti, ti.iixiiisa positively oriented basis.
The previous ideas immediately imply:
Corollary. Consider the vectors u(t) and v(t) in the space R3. The derivatives of their scalar product (u(t),v(t)} and their vector product u(t) x v(t) satisfy
(1)
d di
[u(t),v(t)) = {u'(t),v(t)) + {u(t),v'(t))
(2)    ^x v(t)) =       x v(t) +      x v'{1)
6.1.15. The curvature of curves. We develop far more powerful tools for studying curves in a more systematic way than when we discussed the curvature of the graphs of functions. We proceed in dimension three. Plane curves are a special case in which the third component is the constant zero.
Let r(t) be a curve in the Euclidean space R3. For theoretical purposes, it is convenient to choose arc length s as a parameter. Since dr2 = dx2 + dy2 + dz2 = ds2, it follows that | ^ | = 1, so that the tangent vector has unit length. When s is the parameter, the notation ' is used for differentiation.
So (r''(_), r''(_)) = 1 for all s. The curve r(s) is parametrized by the length s . By another differentiation of
388
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
f"(x) > 0   pro x^0+
implies that / is convex on the interval [0,7r/4), concave on (■7r/4,7r/2], convex on _7r/2, 3tt/4) and concave on (3tt/4, it]. The convexity and concavity of / on other subintervals is given by its periodicity and a simple observation: if a function is even and convex on an interval (a, b), where 0 < a < b, then it's also convex on (—b, —a).
All that's left is computing the derivative (to estimate the speed of the growrth of the function) at the point of inflection, yielding j1 (tt/2) = 1. Based on all previous results, it's now easy to plot the graph of function /. □
6.A.36.   Determine the course of the function
ln(x)
and plot its graph.
Solution, i) First we'll determine the domain of the function:
R+\{1}.
ii) We'll find the intervals of monotonicity of the function: first we'll find zero points of the derivative:
ln(x) - 1 In2(x)
= 0
The root of this equation is e. Next we can see that f'(x) is negative on both intervals (0,1) and (1, e), hence j(x) is decreasing on both intervals (0,1) and (1, e). Additionally, j'(x) is positive on the interval (e, oo), thus f(x) is increasing here. That means the function / has the only extreme at point e, being the minimum, (we can also decide this using the sign of the second derivative of the function / at point e, because/(2)(e) > 0).
iil) We'll find the points of inflection:
/(2)(-) =
ln(x)
And(x)
= 0
The root of this equation is e2, so it must be a point of inflection (it cannot be an extreme with regard to the ptrevious point).
iv) The asymptotes. The line x = 1 is an asymptote of the function. Next, let's look for asymptotes with a finite slope k:
k = lim r
ln(x)
x-i-oo m(x)
this unit vector r'(s), the vector r"(s), for which (using the symmetry of the dot product)
0=-^{r'(s),r'(s)) = 2{r"(s),r'(s))
. Thus the vector r"(s) is always orthogonal to the vector
r'(s).
This corresponds to the idea that after the choice of a parametrization with a derivative of constant length, the second derivative in the direction of the movement vanishes. The second derivative lies in the plane orthogonal to the tangent vector.
If the second derivative is nonzero, the normed vector
1
n(s)
\\r"(s)\
■r"(s)
0.
is the (principal) normal of the curve r(s). The scalar function k(s) satisfying (at the points where r"(s) ^ 0)
r"(s) = k(s)tj(s)
is called the curvature of the curve r(s). At the zero points of the second derivative k(s) is defined as 0.
At the nonzero points of the curvature, the unit vector b(s) = r' (s) x n(s) is well defined and is called the binormal of the curve r(s). By direct computation
0 = ±(b(s),r'(s)) = (bf(s),r>(s)) + (b(s),r"(s)) = (b'(s,r'(s)) + K(S)(b(s),n(s)) = {b'(s), r'(s)),
which shows that the derivative of the binormal is orthogonal to r'(s). b'(s) is also orthogonal to b(s) (for the same reason as with r' above). Therefore it is a multiple of the principal normal n(s). We write
b'(s) = -t(s)ti(s)
. The scalar function r(s) is called the torsion of the curve
r(s).
In the case of plane curves, the definitions of binormal and torsion do not make sense.
389
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
If the asymptote exists, its slope must be 0. Let's continue the computation
x-s-oo ]n(x)
— 0 ■ x = lim ln(a;) = oo,
and because the limit isn't finite, an asymptote with a finite slope doesn't exist.
-10-
□
Now move from determining the course of functions onto other subjects connected to derivatives of functions. First we'll demonstrate the concept of curvature and the osculating circle on an ellipse
6.A.37. Determine the curvature of the ellipse x2 +2y2 = 2 at its vertices (4.C.9). Also determine the equations of the circles of osculation at these vertices.
Solution. Because the ellipse is already in the basic form at the given coordinates (there are no mixed or linear terms), the given basis is already a polar basis. Its axes are the coordinate axes x and y, its vertices are the points [\/2, 0], [— \pl, 0], [0,1] and [0, —1]. Let's first compute the curvature at vertice [0,1]. If we consider the coordinate y as a function of the coordinate x (determined uniquely in a neighbourhood of [0,1] ), then differentiating the equation of the ellipse with respect to the variable x yields 2x + 4yy' = 0, hence y' = — ^ (y' denotes the derivative of function y(x) with respect to the variable x; in fact it's nothing else than expressing the derivative of a function given implicitly, see ??). Differentiating this equation with respect to x than yields y" = — ^ (^ — -pr) ■ At point [1,0], we obtain y' = 0 and y" = — \ (we'd receive the same results if we explicitly expressed y = \ \J2 — x2 from
We have not yet computed the rate of change of the principal normal, which can be written as n(s) = b(s) x r'(s):
n'(s) = b'(s) x r'(s) + n(s)b(s) x n(s)
= —T(s)n(s) x r'(s) + k(s)(—r'(s)) = r(s)b(s) — «(s)r'(s).
Successively, for all points with nonzero second derivative of the curve r(s) parametrized by the arc length, there is derived the important basis (r'(s),n(s), b(s)), called the Frenet frame in the classical literature. At the same time, this basis is used in order to express the derivatives of its components in the form of the Frenet-Serret formulas
^-(s) = K(s)n(s),    ^(s) = T(s)b(s) - K(s)r'(s)
db
ds
(s) = -T(s)n(s).
The following theorem tells how crucial the curvature and torsion are. Notice that if the curve r (s) lies in one plane, then the torsion is identically zero. In fact, the converse is true as well. We shall not provide the proofs here.
Theorem. Two curves in a space parametrized by the length of their arc can be mapped to each other by an Euclidean transformation if and only if their curvature functions and torsion functions coincide except for a constant shift of the parameter. Moreover, for every choice of smooth functions k a t there exists a smooth curve with these parameters.
By a straightforward computation we can check that the \\ curvature of the graph of the function y = f(x) in plane and the curvature k of this curve denned in this paragraph coincide. Indeed, comparing the differentials of the length of the arc for the graph of a function (as a curve with coordinates x(t), y(t) = /(*(*))):
dt = (i + (ufy^dx, dx = (i + (f^y^dt
(here we write /i = £) we obtain the following equality for the unit tangent vector of the graph of a curve
r'(s) = (x'(s),y'(s)) = ((l+Cf,)2)"1/2 /,(l + (/,)2)"1/2).
A messy, but very similar computation for the second derivative and its length leads to
r"U2 = (0)2(l + (/*)2)-3
fxx', x' = (1 +
as expected. If we write r = (x, y), y' /2)"1/2,then
2      )' ^fxfxx-E (*£ ) fxfxx
V fxx{^ )     ~T" fx% fxx{^ ) fxx fx (*^ )
390
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
the equation of the ellipse and performed differentiation; the computation would be only a little more complicated, as the reader can surely verify). According to 6.1.12, the radius of the osculation circle will be
Hence
K)2 + (y")2 = fUx'f (fx + (i + flf + fx
- 2/2(l + f2))
= /i(l+/2)-4(/2 + l) = /**(!+ /s2)-3-
(i + (yr
(y"
-2,
or 2, respectively, and the sign tells us the circle will be "below" the graph of the function. The ideas in 6.1.12 and 6.1.15 imply that its center will be in the direction opposite to the normal line of this curve, i.e. on the y axis (the function y as a function of variable x has a derivative at point [0,1], thus the tangent line to its graph at this point will be parallel to the x axis, and because the normal is perpendicular to the tangent, it must be the y axis at this point). The radius is 2, so the center will be at point [0,1 — 2] = [0, —1]. In total, the equation of the osculation circle of the ellipse x2 + 2y2 = 2 at point [0,1] will be x2 + (y + I)2 = A. Analogously, we can determine the equation of the osculation circle at point [0, —1]: x2 +(y—1)2 = 4. The curvatures of the ellipse (as a curve) at these points then equal \ (the absolute value of the curvature of the graph of the function).
For determining the osculation circle at point [V2, 0], we'll consider the equation of the ellipse as a formula for the variable x depending on the variable y, i.e. a; as a function of y (in a neighbourhood of point [V2, 0], the variable y as a function of x isn't determined uniquely, so we cannot use the previous procedure - technically it would end up by diving by zero). Sequentially, we obtain: 2xx' + Ay = 0, thus x' = -2|, and x" = -2(± - ^-). Hence at point [V2,0], we have x' = 0 and x" = —V2 and the radius of the circle of osculation is p = — = ^ according to 6.1.12. The normal line is heading to —oo along the x axis at point [v7^, 0], thus the center of the osculation circle will be on the x axis on the other side at distance hence at the point [y/2 - 0] = [^,0]. In total, the equation of the circle of osculation at vertice [y/2,0] will be (x — ^)2 + y2 = \. The curvature at both of these vertices equals \/2.
6.1.16. The numerical derivatives. In the begining of this y«5>       textbook we discussed how to describe the val-lft±i|      ues in a sequence if its immediate differences /rm&K%   are known, (c.f. paragraphs 1.1.5, 1.2.1). Be-— fore proceeding the same way with the derivatives we clarify the connections between derivatives and differences. The key to this is the Taylor expansion with remainder.
Suppose that for some (sufficiently) differentiable function f(x) defined on the interval [a, b], the values /, = f(xi) at the points x0 = a, x1, x2, ■ ■ ■, xn = b, are given while x{ — Xi-i = h for some constant h > 0 and all indices i = 1,..., n. Write the Taylor expansion of function / in the form
f(Xi±h)=fi±hf'(2
' + y/"(z0±^/(3)(z0 + .
Suppose the expansion is terminated at the term containing hk which is of order k in h. Then the actual error is bounded by
hk+i
(fc + l)!IJ      ( ,l
on the interval [x{ — h, x{ + h]. If the (k + l)th derivative / is continuous, it can be approximated by a constant. Then for small h, the error of the approximation by the Taylor polynomial of order k acts like hk+1 except for a constant multiple. Such an estimation is called an asymptotic estimation.
Asymptotic estimates
Definition. The expression G(h) is asymptotically equal to
F(h) for h ->• 0. Write G(h) = 0(F(h)), if the finite limit
G(h)
lim , N h^o F(h)
= ae.
exists.
Similarly, compare the expressions for h —> oo and use the same notation.
Denote the values of the derivatives of f(x) at the points
a)
Xi as f:• '. Write the Taylor expansion as:
£11 £111 h±l=h±f[h+^h2±^h3 + ...
Considering combinations of the two expansions and fi itself, we can express the derivative    as follows
391
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
6.A.38. Remark. The vertices of an ellipse (more generally the vertices of a closed smooth curve in plane) can be defined as the points at which the function of curvature has an extreme. The ellipse having four vertices isn't a coincidence. The so called "Four vertices theorem" states that a closed curve of the class C3 has at least four vertices. (A curve of the class C6 is locally given parametrically by points [f(t),g(t)] e R2, t e (a, b) c R, where / and g are functions of the class C3 (R).) Thus the curvature of the ellipse at its any point is between its curvatures at its vertices, i.e. between \ and V2.
B. Integration
We start with an example testing the understanding the concept of Riemannian integration.
6.B.I. Let y = x | on the interval I = [—1,1] and let
-1,
,0,
,1
be a partition of the interval I for arbitrary n e N. Determine Ssn , SUp and Ssn , m (the upper and lower Riemann sum corresponding to the given partition).
Based on this result decide if the function y = x | on [—1,1] is integrable (in Riemann sense). O
And now some easy examples that everyone should handle.
6.B.2.   Using integration "by heart", express
(a) J zTx dx,
(b) I-
dx, x e (-2,2);
(c) / dx,
(d) f^kdx,x^-l.
Solution. We can easily obtain
fi+l - fi-1
2h
fi+l — fi
h
fi fi-
= /; + ^3) + -
j t 2!
This suggests a basic numerical approximation for derivatives:
Central, forward, and backward differences
The central difference is denned as j[
li-
-fi-i
ward difference is ence is // = .
- liz
-Ji
the/or-
, and the backward differ-
If we use the Taylor expansions with remainder of the appropriate order, we obtain an expression of the error of the approximation by the central difference in the form
^2(/(3)(^+^)-/(3)(^-^))-
Here, 0 < £, 77 < 1 are the values from the remainder expression of fi+i and fi-i, respectively. The error of the second derivative in the other two cases is obtained similarly. Thus, under the assumption of bounded derivatives of third or second order, the asymptotic estimates are computed:.
Theorem. The asymptotic estimate of the error of the central difference is 0(h2'). The errors of the backward and forward differences are 0(h).
Surprisingly, the central difference is one order better than the other two. But of course, the constants in the asymptotic estimates are important, too. In the case of the central difference, the bound on the third derivative appears, while in the two other cases second derivatives show up instead.
We proceed the same way when approximating the second derivative. To compute f"(xi) from a suitable combination of the Taylor polynomials, we cancel both the first derivative and the value at x{. The simplest combination cancels all the odd derivatives as well:
fi+i — 2fj + fi-h2
h2
12J K
This is called the second order difference. Just as in the central first order difference, the asymptotic estimate of the error is
^2) = SI±1^h±h-± + 0{h2-y
Notice that the actual bound depends on the fourth derivative of/.
2. Integration
392
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(a) J e x dx = — J —e x dx = — e x + C;
(b) f ,} 9 dx = f  i 2     dx = arcsin § + C;
(C) / d2;  =   5 / d2;   =   73 / l+(^)a ^ =
^arctg^+C;
(d) / ^3+3++2 ^ = In | z3 + 3x + 2 | + C,
where we used the formula J j^y dx = In | f(x) | + C.
6.B.3.   Compute the indefinite integral
□
dx
f[7x +4ef - ^ + 9sin5a; + 2cos§ -for _ 7^ 3, x    -| + kir, k G Z.
Solution. Only by combining the earlier derived formulas, we obtain
J[7X + 4eT - ^ + 9 sin 5a;+ 2 cos §
cos2 x       3 —x
+ t.
dx
+ 6 e 3 + 2x|n2 - | cos 5x + 4 sin | - 3tg x
\n\3-x\ + C.
□
For expressing the following integrals, we'll use the method of integration by parts (see 6.2.3).
6.B.4.   Compute J a; cos a; da;, x G Rand J In a; da;,
x > 0;
Solution.
u = In a;   v! = -
X
v' = 1       V = X
In x Ax ■
= a;lna; — / 1 da; = a; In a; — x + C.
x cos x dx =
u = x        u1 = 1 v1 = cos a;   v = sin a;
sin x dx = x sin x + cos x + C.
□
6.B.5.   Using integration by parts, compute
(a) / (x2 + 1) e~x dx, x G R,
(b) /(2a; - l)ln_ da;, x > 0,
(c) Jarctga; da;, a; G R,
(d) / ex sin(/S_) da;, a;, /3 G R,
6.2.1. Indefinite integral. Now, we reverse the procedure feg, of differention. We want to reconstruct the actual values of a function using its immedi-ate changes. If we consider the given function j(x) as the (say continuous) derivative of an unknown function F(x), then at the level of differentials we can write dF = f(x) dx.
We call the function F the primitive function or the indefinite integral of the function /. Traditionally we write
F(x) = Jf(x)dx.
Lemma. The primitive function F(x) to the function f(x) is determined uniquely on each interval [a, b] up to an additive constant.
Proof. The statement follows immediately from Lagrange's mean value theorem, see 5.3.9. Indeed, if F'(x) = G'(x) = j(x) on the whole interval [a, b], then the derivative of the function (F — G) (x) vanishes at all points c of the interval [a,b\. The mean value theorem implies that for all points x in this interval,
F(x) - G(x) = F(a) - G(a) + 0-(x-a).
Thus the difference of the values of the functions F and G is constant on the interval [a,b\. □
The previous lemma supports another notation for the indefinite integral:
F{x) = J f(x) dx + C with an unknown constant C.
6.2.2. Newton integral. We consider the value of a real function j(x) as an immediate increment of the region bounded by the graph of the function / and the x axis and try to find the area of this region between boundary values a a & of some interval. We relate this idea with the indefinite integral.
Suppose we are given a real function / and its indefinite integral F(x), i.e. F'(x) = j(x) on the interval [a, b].
Divide the interval [a, b] into n parts by choosing the points
a = xq < x\ < ■ ■ ■ < xn = b. Approximate the values of the derivatives at the points x{ by the forward differences. That is, by the expressions
F(xi+1) -F(Xi)
f(xi)=F'(xi)
Xi+l
Finally the sum over all the intervals of our partition yields the approximation of the area:
71— 1 77—1
-£f(Xi)(Xi+1-Xi) ^EF(^!"Wi-^)
j2(F(xt+1)-F(xt)) = F(b)-F(a).
i=0
393
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Solution. First emphasise that by integration by parts, we can compute every integral in the form of
J P(x) abx dx,    J P(x) sin (bx) dx,    J P(x) cos (bx) dx,
fP(x)lognaxdx, J P(x) arcsin (bx) dx, J P(x) arctg (bx) dx,
f xb log™ (kx) dx, J P(x) arccos (bx) dx, J P(x) arccotg (bx) dx,
l (cx) dx,       J i
! (cx) dx,
where P is an arbitrary polynomial and
a £ (0,1) U (1, +oo),   i),c£K\{0},   neN,   k > 0. Thus we know that (a)
F(x) G'(x)
J (x2 + 1) e~x dx = F'(x) = 2a G(x) = -e"
a2 + l = e~x
(b)
- (x2 + 1) &~x + J2xe~x dx = F(x) = 2x     F'(x) = 2 G'(x)=e~x   G(x) = -e~x x2 + 1) &~x - 2x &~x + J 2 &~x dx x2 + 1) eTx - 2x &~x - 2 &~x + G --t~x (x2 + 2x + 3) + C;
J(2x — 1) lnx dx =
F(x) = lna G'(x) = 2a - 1
F'(x) = 1/a G(x) = x2 -
(a2 — a) In a — f        dx = (a2 — a) lna + f 1 —a dx = (a2 — a) lna+a — ^r + C;
(c)
/ arctg a dx
F(x) = arctg a G'(a) = 1
F'(x) = G(x) = x
; arctg x- f jf^; dx = a arctg a - |J       dx = x arctg a — \ In (l + a2) + C;
(d)
J ex sin(ßx) dx = F(x) =ex I F'(x) =ex \_
G'(x) = sin(/3a) | G(x) = -i cos/3a | ~
— jex cos(ßx) + i J ex cos(/3a) da = F(x) = &x I F'(x) =ex \_
G'(x) = cos(ßx) I G(x) = i sin(ßx) | ~ — ^ex cos(/3a) + -^&x sin()/3a) — ^j- J ex sin(/3a) da,
which implies
J ex sin a da = j+jp ex (sin(/3a) — ß cos(ßx)) + C.
□
Therefore we expect that for "nice enough" functions /(a), the area of the region bounded by the graph of the function and the a axis (including the signs) can be calculated as a difference of the values of the primitive function at the boundary points of the interval. This procedure is called the Newton integration?
^ *i Xt
Newton integral
If F is the primitive function to the function / on the interval [a,b], then we write
* f(x)dx=[F(x)]ba = F(b)-F(a)
and call it the Newton (definite) integral with the bounds a and b.
We prove later that for all continuous functions / G C° (a, b) the Newton integral exists and computes the area as expected. This is one of the fascinating theorems in elementary calculus. Before going into this, we discuss how to compute these integrals.
The primitive functions are well denned for complex functions /, where the real and the imaginary part of the indefinite integrals are real primitive functions to the real and the imaginary parts of /. Thus, with no loss of generality, we work only with real functions in sequel.
6.2.3. Integration "by heart". We show several procedures ■ for computing the Newton integral. We exploit the knowledge of differentiation, and look for primitive functions. The easiest case is the one where the given function is known as a derivative. To learn such cases, it suffices to read the tables for function derivatives in the menagerie the other way round. Hence:
Isaac Newton (1642-1726) was a phenomenal English physicist and mathematician. The principles of integration and differentiation were formulated independently by him and Gottfried Leibniz in the late 17th century. It took nearly another two centuries before Bernhard Riemann introduced the completely rigorous modern version of the integration process.
394
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
For expressing the following integrals, it's convenient to use the substitution method (see 6.2.5).
6.B.6.   Using a suitable substitution, determine
Integration table
(a) J ^2x - 5 dx, x > §;
(b) dx,x>0;
(0 !jrW^dx,x^^±p_,ke^
Solution. We have
(a)
(b)
(c)
(d)
J \l2x — 5 dx =
t = 2x-5 dt = 2dx
= W Vidt
= IfT +C = ^(2x-5)3 + C;
^^dx =
t = 7 + In x dt = — dx
X
_ (7+ln x '
= ft7dt=^+C
+ C;
In
(1+sin x)2
dx —
t = 1 + sin x dt = cos x dx
_ r dt - J t2
1+sin x
r    cos x    ^x .
\/l+sin2 x
t = sin a; dt = cos x dx
u = t + Vl + t2 > 0
du =   1 +
dt
/l+t2 1 -dt
J ^du = \nu + C =
t+Vl+t2 Vl+t2
In (t + vTTF) + C
= In (sinx + \J\ + sin2x ) + C.
For arbitrary nonzero a, b e R and neZ,n/-l: a dx = ax + C
axn dx = -^r[Xn+1 + C ■ e~~ +C
— dx = a In x + C
x
acos(bx) dx = ^ sin(&a;) + C
asin(&a;) da; = —^ cos(&a;) + C
acos(bx) sinn(&a;) dx = b,^+1\ sinn+1(&a;) + C
i(bx) cosn(bx) dx = -
atg(&a;) dx = — — ln(cos(&a;)) + C
■ dx = arctg (-) + C
(bx) + C
az + x'
dx = arccos (-) + C
■. dx = arcsin (-) + C.
In all the above formulae, it is necessary to clarify the domain on which the indefinite integral is well denned. We leave this to the reader.
Further rules can be added by observations of suitable structure of the given functions. For example,
/(*)
dx = ln|/(x)| +C
□
for all continuously differentiable functions / on intervals where they are nonzero.
Of course, the rules for differentiating a sum of differentiable functions and constant multiples of differentiable functions yield analogous rules for the indefinite integral. So the sum of two indefinite integral is the indefinite integral of the sum of the integrated functions, up to the freedom in the chosen constant, etc.
6.B.7. Determine the integrals a)
/  J sin
b) Jx2^/2x + Ida;.
6.2.4. Integration by parts. The Leibniz rule for deriva-
can be interpreted in the realm of the primitive ■■■'fMh^l-.   functions. This observation leads to the follow-
ing very useful practical procedure. It also has theoretical consequences.
395
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Solution. For computing the first integral, we'll choose the substitution t = tg x, which can be often used with an advantage.
da;
J(x) — cos2 (a;)
substitution t = tgx
dt = da; = (1 + tg2(a;)) dx = (1 + t2) dx
sin2(x) - .islM
\x) =
l+tg2(x)
t2
— dt =
1 2
l+tg2(x)
1  f 1
i+t2 l
i+t2
t - 1
Integration by parts The formula for computing the integral on the left hand side
is called integration by parts.
The above formula is useful if we can compute G and at the same time compute the integral on the right hand side. The principle is best shown on an example. Compute
I =    a; sin a; dx.
t + 1
1
In this case the choice F(x) = x, G'(x) = sin a; will help. Then G(x) = — cos x and therefore
tg+1
Now we'll compute the second integral:
x2y/2x + Ida;
u = x2 u = 2x
v' = y/2x + 1   v = ±(2x + l)
I = x{— cos x) — J — cos x dx = —x cos a; + sin a; + C.
Some integrals can be dealt with by inserting the factor 1, so that G'(x) = 1:
= -a;2(2a; + l)f - - f x2V2x + ldx - -(2a; + 1)1 + C, 3 3 J 9
which can be thought of as an equation, when the variable is
the integral. By putting it on one side,
a;V2a; + Ida;
In x dx = / 1 ■ In x dx
= a;lna;— / — x dx = x Inx — x + C.
x
x2(2x + l)2
2
~ 7 u' = 1
xV2x+T
U = X
v1 = y/2x + 1   v = ±^2x + 1
6.2.5. Integration by substitution. Another useful procedure is derived from the chain rule for differentiating composite functions. If
F'(y) = f(y),     y = <p(x),
where ip is a differentiable function with nonzero derivative, then
dF(p(x))
1 32/1 If 3
-x2(2x + l)2 - - ( -aV2a; + 1 - -    (2x + l)2 da;
dx
F'(y) ■ p'(x)
x2(2x + l)i
2 a;v/2a7TT+T^-(2a; + l)t
21 2
105 * 2
and thus F(y) + C = J f(y) dy can be computed as
F(<p(x)) + C= [ f{p{x))p'{x)dx.
(2x + 1)2--a;(2a; + 1)2 +-(2a; + l)2 + C
35
105
□
By substituting x = p 1 (y), we obtain the originally desired primitive function. This is often written as follows:
Integration by substitution
6.B.8.   Using the basic formulas, compute
(a) f ^ dx, x ^0;
(b) / tg2 x dx, x ^ I + kir, k e Z;
(d) J 6 sin 5a; + cos f + 2 e"r dx, i£R. Solution. Case (a). We can immediately determine
f^dx = fx-V*dx.
+ C :
3 ?f^2
X2 + C,
where the notation in which we add Ce R has to be understood in a way that we can get all primitive functions exactly by a constant translation of an arbitrary primitive function.
If p (x) is differentiable with a nowhere vanishing derivative, then
F(y) = J f(y)dy = J f(p(x))p'(x)dx = F(p(x)).
We talk about substituting the variable y by y = p(x).
On the level of differentials, the substitution can be easily understood in a way that (linearized) increments of the variable y and a; are in mutual relation formally described by
dy = p'(x) dx,
which corresponds the relation between the integrated quantities
f(y) dy = f(<p(x))<p'(x) dx.
396
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
But that's only true on an interval. In other words, the value C is generally distinct for a < 0 and for x > 0. Thus we should consider the values C\ and C2. For the sake of simplicity though, we'll use the notation without indices and stating the corresponding intervals. Furthermore, we'll help ourselves by letting aC = C for a e R \ {0} and C + b = C for b e R, based on the fact that
{C; C G R} = {aC; C G R} = {C + b; C G R} = R.
We could then obtain an entirely correct expression for example by substitutions C = aC, C = C + b. These simplifications will prove their usefulness when computing more complicated problems, because they make the procedures and the simplifications more lucid.
Case (b). Sequential simplifications of the integrated function lead to
■ dx =
/ ^7- dx- fldx = tgx-x + C, where we helped ourselves by the knowledge of the derivative (tgaO'^d^       x=£% + kir,keZ. Case (c). It suffices to realize that this is a special case of the formula
Jj$dx = ]n\f(x)\+C, which can be verified directly by differentiation
(In | f(x) \+C)' = (In i±f(x)])' + (€)' = =
±/(a) _ /M ±/(a)     f(x) ■
Hence
Case (d). Because the integral of a sum is the sum of integrals (if the seperate integrals are sensible) and a nonzero constant can be factored out of the integral at any time, we have
J 6 sin 5x + cos § + 2 dx = -| cos5x + 2 sin § + 3eT + C.
6.B.9. Determine
(a) / T^h dx, x ^ z + kir, k e Z;
(b) j x2 e~3x dx,
(c) J cos2 a; dx, x G R.
Solution. Case (a). Using integration by parts, we obtain
As an illustration, we verify the last but one integral in the list in 6.2.3 using this method. To compute
1
I =
vT
: dx.
Choose the substitution x = sin t. Then dx = cos tdt. So
I =
1
VT - sin2 t = / dt = t + C.
cos tdt =
1
VCOS2 Í
cos í di
By substitution t = arcsin x into the result, I = arcsin x+C.
While substituting, the actual existence of the inverse function to y = ip(x) is required. To evaluate a definite Newton integral, it is needed to correctly recalculate the bounds of integration. Problems with the domains of the inverse functions can sometimes be avoided by dividing the integration into several intervals. We return to this point later.
6.2.6. Integration by reduction to recurences. Often the jjfi i, use of substitutions and integrating by parts leads to recurent relations, from which desired integrals can be evaluated. We illustrate by an example. Integrat-
ing by parts, to evaluate
cosm x dx
' x cos x dx
= cos™1   a sin a — (m — 1) J cos™1   a(— sin a) sin a dx
= cos—asina + (m-l)/cos-2asin2ada.
Using the formula sin2 x = 1 — cos2 a,
mlm = cosm-1 asina + (m — l)Jm_2 The initial values are
To = x,    I\ = sin a.
Integrals in which the integrated function depends on expressions of the form (a2 + 1) can be reduced to these types of integrals using the substitution a = tgt. For example, to compute
_ f dx k    J (a2 + l)fc the latter substitution yields (notice that dx = cos-21 dt)
Jk —
dt
s2t( ^44 + i)
\ cos^ t J
= I cos2k~2 t dt
□   For k = 2, the result is
J2 = -(costsmt + t) = - ^^—^ + t
After the reverse substitution t = arctg a 1 / a
J2
2 V 1 + a-
+ arctg a   + C.
When evaluating definite integrals, we can compute the whole recurrence after evaluating with the given bounds. For
397
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
■ dx
/      = x G'(x^ - 1
cos^ X
sin x
F'(x) = 1 G(x) =tgx
= X tg x — cos x I + C.
J tg x dx = x tg x + J -^f dx = x tg x + In
Case (b). This time we are clearly integrating a product of two functions. By applying the method of integration by parts, we reduce the integral to another integral in a way that we differentiate one function and integrate the second. We can integrate both of them (we can differentiate all elementary functions). Thus we must decide which of the two variants of the method we'll use (whether we'll integrate the function y = x2, or y = e~3x). Notice that we can use integration bz parts repeatedlz and that the n-th derivative of a polynomial of degree n e N is a constant polynomial. That gives us a way to compute
I-
' dx
F(x) G'(x) x2 e~3x + § /:
_ ~ —3x
F'(x) = 2x
-3x
-3x
dx
and furthermore
x e
_ I,
3 -
~3x dx —
F(x) = . G'{x) =
-3x
F'(x) = 1 G(x)- 1
: dx
-\x& 3x
-3x
3 " 1 ~—3x 9 e
+ c.
In total, we have
x e
~3x dx = -ii'e
c2,
-3x
~3x - § xe
3x__2_ „-
27 C
3x
+ c =
[x2 +2-x+29)+C.
Note that a repeated use of integration by parts within the scope of computing one integral is common (just like when computeing limits by the l'Hospital rule).
Case (c). Again we apply integration by parts using
J cos2 x dx = J cos x ■ cos xdx =
F(x) = cos x G' (x) = cos x
F' (x) = — sin a; G(x) = sin a;
cos x ■ sin x + J sin2 xdx = cos x ■ sin x + J 1 — cos2 xdx = cosx ■ sinx + f 1 dx — f cos2 xdx = cos a; ■ sin a; + a; — j cos2 a; da;.
Although the return to the given integral might make the reader cast some doubts on it, the equality
j cos2 xdx = cos x ■ sin x + x — j cos2 x dx
implies
2 j cos2 xdx = cos a; ■ sin x + x + C,
i.e.
1
(1)
cos2 xdx = - (x + sin x ■ cos a;) + C.
It suffices to remember that we put C/2 = C and that the indefinite integral (as an infinite set) can be represented by one specific function and its translations.
example while integrating over the interval [0, 2ir], the integrals have these values:
Jo = /    dx = [x\l* = 2tt Jo
r2ir
I\ = I    cos xdx = [sin a;]2/1 = 0
o
2tt
0 for even m
Im-2
Thus for even m = In, the result is
Im = I    cosm x dx Jo
2tt
2t? 7
cos   x dx -
(2n-l)(2n-3)...3-l
2tt.
Jo 2n(2n - 2) ... 2
For odd m it is zero (as could be guessed from the graph of the function cos x).
6.2.7. Integration of rational functions. The next goal is the integration of the quotients of two polynomials j(x)/g(x). There are several simplifications to start with.
If the degree of the polynomial / in the numerator is greater or equal to the degree of the polynomial g in the denominator, carry out the division with remainder (see the paragraph 5.1.2). This reduces the integration to a sum of two integrals. The division provides
f = q-g + h,
f
- = q+-.
9 9
Thus, / f(x)/g(x) dx = f qdx + f h(x)/g(x) dx where the first integral is easy and the second one is again an expression of the type h(x)/g(x), but with degree of g(x) strictly larger than the degree of h(x) (such functions are called proper rational functions).
Thus we can assume that the degree of g is strictly larger than the degree of /. We introduce the procedure to integrate proper rational functions by a simple example.
Observe that we can integrate (a + x)~n,n > 1, and
-dx = In I a + x I + C.
a + x
Summing such simple fractions yields more complicated ones:
-2 6 4x + 2
+
x + 1     x + 2     a;2 + 3a; + 2 which can be integrated directly:
4a;+ 2
x2 + 3x + 2
■ dx ■
-21n|a; + l| + 61n|a; + 2| +C.
This suggests looking for a procedure to express proper rational functions as a sum of simple ones. In the example, it is straightforward to compute the unknown coefficients A and B, once the roots of the denominator are known:
4a;+ 2     _      4a;+ 2      _   A B x2 + 3x + 2 ~ (x + l)(x + 2) ~ x + 1 + x + 2'
398
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
We emphasise that usually suitable simplifications or substitutions lead to the result faster than integration by parts. For example, by using the identity
cos2 x = i (1 + cos 2x) , i£l we easily obtain
J cos2 x dx = J \dx + $\cos2xdx = l + + C =
- +
2 1  J  2        ^ 2   1 4
2sinlcosI+C= \{x + smx-cosx)+C.
□
6.B.10. Integrate
(a) J cos5 x ■ smx dx,
(b) f cos5 x ■ sin2 x dx,
(c) /Ä^^e(-f,f);
(d) i-^+v^ ^ a. > o.
Solution. Case (a). This is a simple problem for the so called first substitution method, whose essence is writing the integral in the form of
(1)
f(p(x)) p'(x) dx
for certaing functions / and p. Using the substitution y = p(x), (we also substitute dy = p'(x) dx, which we get by differentiating y = p(x)) , such integral can be reduced to the integral J f(y) dy. By substituing y = cos x, where dy = — sin xdx,we then obtain
J cos5 x ■ sin x dx = — J cos5 x(— sin x) dx = - Jy5dy =
-\ + c =
Case (b). Using the equality
+ C.
I cos2 x)2 sin"1 x ■ cos xdx =
; dx = j (cos2 x)Z sin2 : j (l — sin2 x)2 sin2 x ■ cos x dx we're tempted to use the substitution t = sin x, which yields
J cos5 x ■ sin2 x dx
J (1 - t2)2 t2 dt = Jt6-2t4 + t2dt
t = sin a; dt = cos x dx
7
■ +
^k + T + C-+ C.
7 5       1 3
Case (c). Because both sine and cosine are contained in an even power, we cannot proceed as in the previous problem. Let's try to use the so called second substitution method, which means a reduction of J f(y) dy to the form (1) for y = p(x). A situation in which we replace a simple expression by a more complicated one might seem surprising. But don't forget that this more complicated integral might have
Multiply both sides by the polynomial x2+3x+2 from the denominator and compare coefficients of the individual powers of x in the resulting polynomials:
4x+2 = A(x+2)+B(x+l) ==> 2A+B = 2, A+B = 4.
This procedure is called decomposition into partial fractions.
This is a purely algebraic procedure based on properties of polynomials.
Without loss of generality, suppose that the denominator g (x) and the numerator f(x) do not share any real or complex roots and that g (x) has exactly n distinct real roots ai,..., an. Then the points a 1,... an are all the discontinuities of the function f(x)/g(x).
Split the expression 4^1 according to the factors of the denominator. Thus, assume g(x) is the product
g(x) =p(x)q(x)
of two coprime polynomials. By the Bezout identity (see 12.2.9 on the page 822), which is a corollary of the polynomial division with a remainder, there exist polynomials a(x) and b(x) of degrees strictly less than the degree of g such that
a(x)p(x) + b(x)q(x) = 1.
Multiplying this equality by the quotient f(x)/g(x), gives
f(x) = a(z) Hx)_ g(x)     q(x) p(x)'
Thus, we may restrict our attention to cases where the denominator g(x) cannot be decomposed further into two coprime poynomials.
Suppose that the polynomial g(x) has only real roots. Then there is a unique decomposition into factors (x — a,i)ni, where n{ are the multiplicities of the roots a{, i = 1,..., k. By a sequential use of the latter procedure with coprime polynomials p(x) and q(x), we obtain a representation of f(x)/g(x) as a sum of fractions of the form
f(x) =     n(x)     _^_____ rk(x)
g(x)     (x - ai)ni (x - ak)mi '
where the degrees of the polynomials ri(x) are strictly smaller than the degrees of the denominators. Finally, each can be represented as a sum
r(x) Ax    ,     A2 An
+
_l-----1_
(x — a)n     x — a     (x — a)2 (x — a)n
Indeed, we multiply the equation by (x—a)n and start comparing the coefficients from the highest powers of the polynomial r(x) and compute sequentially Ai, A2,... after expanding all the products. This can be done faster by suitable additions and subtractions, starting by the highest orders. For example,
5x -16        x - 2      „1 5 6
6-
+ ■
(x-2)2      (x-2)2      (x-2)2     x-2 (x-2)2'
Finally, we have to handle the case, where there are not enough real roots. There always exists a factorization of g(x) into linear factors with complex roots (see the fundamental theorem of Algebra in 12.2.8 on page 820). The non-real
399
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
such a form that we may just be able to compute it. We want to determine the primitive function of function f(x) = tg4 x. Thus it's sensible to consider the substitution u = igx. We obtain
rss^riv- z = arctg «
J  cos4 x aX ~    At — du
f v? — 1 H—ttt du = ^ - u + arctg u + C - tgx + arctg (tg x) + C
— tg x + x + C.
f       du =
tgJ X 3
Case (d). We have
^      $ x5-\-x
6z5 dz = dx
Gf1-=Tff1dz = Qfz2-2z + 2-^ridz =
6      - z2 + 2z - ln I z + 1 l) + C = 2^ - Q^x + 12$x - 6 ln (f/x~ + 1) + C,
where we again easily determined by substitution (for z ^
-1)
v = z + 1 dv = dz
m\z + l\ + C.
I
dz z+1
: J     = In I v I + C
□
6.B.11. By combining integration by parts and the substitution method, determine
(a) / a;3 e"x2 dx,
(b) Jxarcsina;2 dx, x e (—1,1).
Solution. Case (a). The substitution method leads to the integral
t = -x2 dt = —2x dx
which can be easily computed by integrating by parts, yielding
\ J t e* dt =
I-
dx ■
\§tzt dt,
F(t) = t	F'(t) = 1
G'(t) = e*	G(t) = e*
|fe* - |/e*ctt = 7)te^    2^ + C= 2^
Case (b). Similarly, we obtain
t = x
J x arcsin a;2 dx =
F(t) G'(t)
dt = arcsin t
1
2
2x dx F'(t)
G(t):
(x2 + l )+C.
= \ J arcsin tdt =
/i-t2
\t arcsin t — \ J -yj=f dt \t arcsinf + \j^ = liarcsinf + \$ü+ C
--1
u = 1 -t2 du = —2t dt
roots always appear in conjugated pairs, since g(z) = g(z) for a polynomial with real coefficients.
Repeating the above procedure for ratios of complex polynomials gives the same result, but with complex coefficients. If we insist in having real expressions only, we may collect the conjugate pairs together and get quadratic factors
expressed as sums of squares (x
' + b2 and their powers.
The procedure works well and guarantees that it is possible to find summands in the form of
Bx + C
((x
+ b2)n'
As in the real roots case, there is always a corresponding decomposition into partial fractions of the form
Aix + Bi (x - a)2 + b2
_l-----1_
A_nx + Bn
((x-a)2 + b2y 2 + b2)n of such quadratic
in the case of a power ((x — a (irreducible) factor as well.
The factorization of the polynomials and the further computations might be quite time consuming. The reader could prefer to experiment with computer algebra software instead. This works well in Maple by calling the procedure convert (h, parfrac, x) that decomposes the expression h rationally dependent on the variable x into partial fractions.
The important point is that we can already integrate all of the above partial fractions. The last mentioned ones lead to integrals discussed in example 6.2.6.
In summary, the rational functions j(x)/g(x) can be integrated easily, if the corresponding decomposition of the polynomial in the denominator g(x) is known. The reality is not that simple when computing (definite) Newton integrals. Although we find the primitive functions, the problematic points are the discontinuities of rational functions, in whose neighbourhood these functions are unbounded. We return to this problem later (see paragraph 6.2.16 below).
6.2.8. Riemann integral. We return to the idea of denning the integral as a tool for computing the area of the region bounded by the graph of a function and the x axis. This is our next goal. We prove that for all continuous functions on a closed bounded interval, this definition yields the same result as the Newton integral.
Consider a real function / denned on the interval [a, b]. Choose a partition of this interval along with the choice of representatives ^ of the respective parts, i.e. a = x0 < x\ < ■ ■ ■ < xn = b and £_ £ [x__i,a;_], i = l,...,n. The number S = min_{a;_ — a;j_i} is called the norm of the partition. Define the Riemann sum corresponding to the chosen partition along with the chosen representatives
" = (^0, • • • , xn', £li • • • , Cra) as
\tarcsinf+Wl - t2+C = \x2arcsina;2+Wl -x4+C.
S~ = _>_ /(&) ' (xi -
400
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
6.B.12.   Compute the integral
f Vi - x2 dx, in two different ways. Solution. Integration by parts yields
f Vl — x2 dx
F(x) = \/l-x2 G'{x) = 1 x Vl — x2 + f  ,f   , dx = x Vl —
J   \J\ — xA
G(x) = x
l-x2-l
.fl=j2-=±dx.
J \J\—XA
X
vT"
X*
x vT — x which implies
f Vl - a;2 da; + f -7=^ dx = 2 — f Vl — x2 dx + arcsin x,
2 f Vl — a;2 dx = x Vl — a;2 + arcsin a; + C,
J Vl — a;2 <__ = 5 (a; Vl — x2 + arcsin x) + C. The substitution method along with (1) then yields
a; = smy dx = cos y dy
sin2 y ■
f Vl — x2 dx
cos y dy = f cos2 y dy = | (y + sin y ■ cos y) + C
\ [smy - \J\ - sin2 y + y) + C = ^ (a; Vl — x2 + arcsin a;) + C,
where y G (—ir/2, it/2) for a; G (—1,1), thus among other things, we have
0 < cosy :
cos y
sj cos2 y = \/l — sin2 y.
□
6.B.13. Determine
/ e^ da;,
a; > 0.
Solution. This problem can illustrate the possibilities of combining the substitution method and integration by parts (in the sscope of one problem. First we'll use the substitution y = yfx to get rid of the root from the argument of the exponential function. That leads to the integral
f e^ da;
V2 = x
2fyeydy.
2y dy = dx
Now by using integration by parts, we'll compute
fye.y dy =
F(y) = y G'(y) =
F'(y) G(y) =
= 1 &y
y&y - $ &y dy = y&y - &y + c.
Thus in total, we have
fe^dx = 2y&y - 2 & + C = 2 e^ (y/x - 1) + C.
?1   ?i % *fy
RlEMANN INTEGRAL4
Definition. The Riemann integral of the function / on the interval [a, 6] exists, if for every sequence of partitions with representatives (Sk)^L0 with norms of the partitions 5k approaching zero, the limit
lim Ssk = S
exists and its value does not depend on the choice of the sequence of partitions and their representatives. Then we write
S ■
f(x) dx.
This definition does not look very practical, but nonetheless it allows us to formulate and prove several simple properties of the Riemann integral:
Theorem. (1) Suppose f is a bounded real function defined on the interval [a, b], and c G [a, 6] is an inner point of this interval. Then the integral f^ f(x) dx exists if and only if
both of the integrals f^ f(x) dx and f^ f(x) dx exist. In that case
b pc pb
f(x)dx= I  f(x)dx+ I f(x)dx.
J a J c
(2) Suppose f and g are two real functions defined on the interval [a, 6], and that both of the integrals fa f(x) dx and
fa g(x) dx exist. Then the integral of their sum also exists and
b fb fb
(f(x) + g{x)) dx=      f(x) dx+      g{x) dx.
J a J a
(3) Suppose f is a real function defined on the interval i, ft], C £ I is a constant, and the integral fa f(x) dx exists.
4Bernhard Riemann (1826-1866) was an extremely influential German mathematician with many contributions to infinitesimal analysis, differential geometry, and in particular complex analysis and analytic number theory.
401
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
6.B.14.   Prove that
1 . „
— sin x ■
2
113
-- cos(2a;) H--cos(4a;) H--.
4     v   '    16     V   ' 16
Solution. Easier than to compare the given expressions directly is to show that the functions on the right and left hand side have the same derivatives. We have LI = 2 cos x sin3 x = sin(2a;) sin2 x,
P' = | sin(2a;) + \ sin(4a;) = sin 2x(\ + \ cos(2a;)) = sin(2a;) sin2 x. Hence the left and the right hand side differ by a constant. This constant can be determined by comparing the values at one point, for example 0. Both functions are zero at zero, thus they are equal. □ Integration of rational functions.
The key to integration of rational functions lies in decomposition of a rational function as a sum of a simple rational functions, which we know, how to integrate. Let us decompose some rational functions:
6.B.15. Carry out the suggested division of polynomials
2xb-x4+3x2-x+1 x2-2x+4
for x e
6.B.16. Express the function
y:
3xl+2x3-x2 + l 3x+2
as a sum of a polynomial and a rational function. 6.B.17. Decompose the rational expression
( \ 4x2 + l3x-2 . W x3+3x2-4x-12>
(b)
o
o
2x°+5xJ-x^+2x-l xe+2x4+x2
into partial fractions. 6.B.18. Express the function
,, _ 2x3 +&x2 +3x-&
y ~ x*-2x3
in the form of partial fractions. 6.B.19. Decompose the expression
7x2-10x+37
x3-3x2+Qx+V3
into partial fractions.
6.B.20. Express the rational function
_ -5x+2 y — xi-x3+2x2
in the form of a sum of partial fractions. 6.B.21. Decompose the function
y= x3(x+i)
o
o
o
o
Then the integral Ja C ■ f[x) dx also exists and
C ■ f(x)dx = C ■ / f(x)dx
Proof. (1) First suppose that the integral over the whole interval exists. When computing it, we can limit ourselves to limits of the Riemann sums whose partitions have the point c among their partitioning points. Each such sum can be obtained as a sum of two partial Riemann sums. If these two partial sums would depend on the chosen partitions and representatives in the limit, then the total sums could not be independent on the choices in limit. (It suffices to keep the sequence of partitions of the subinterval the same, and change the other so that the limit would change).
Conversely, if both Riemann integrals on both subinter-vals exists, they can be approximated with arbitrary precision by the Riemann sums, and moreover independently on their choice. If a partitioning point c is added to any sequence of Riemann sums over the whole interval [a, b], the value of the whole sum is changed. Also the values of the partial sums over the intervals belonging to [a, c] and [c, 6] change at most by a multiple of the norm of the partition and possible differences of the bounded function / on all of [a, b]. This is a number arbitrarily close to zero for a decreasing norm of the partition. Necessarily the partial Riemann sums of the function over the two parts of the interval also converge to the limits, whose sum is the Riemann integral over [a, b].
(2) In every Riemann sum, the sum of the functions manifests as the sum of the values in the chosen representatives. Because multiplication of real numbers is distributive, each Riemann sum becomes the sum of the two Riemann sums with the same representatives for the two functions. The statement follows from the elementary properties of limits.
(3) Each of the Riemann sums is multiplied by the constant C. So the claim follows from the elementary properties of limits. □
6.2.9. The fundamental theorem. The following result is crucial for understanding the relation between the integral and the derivative. The complete I'iuNfe   proof of this theorem is somewhat longer, so it is broken into several subsections.
Fundamental theorem of integral calculus
Theorem. For every continuous function f on a finite interval [a, b] there exists its Riemann integral ja f(x)dx. Moreover, the function F(x) given on the interval [a,b] by the Riemann integral
F(x) = / f(t)dt
J a
is a primitive function to f on this interval.
402
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
into partial fractions.
o
6.B.22. Determine the form of the decomposition of the rational function
_ 2x2-U4 y ~ (x-2)x2 (3x2+x+4)2
into partial fractions. Don't compute the undetermined coefficients! O
6.B.23. Express the function
_ xA+bx2+x-2
y ~ x4-2x3
as a sum of a polynomial and a proper rational function Q. Then express the obtained function Q in the form of a sum of partial fractions. O
6.B.24. Write the primitive function to the rational function
(a) y ■
(b) y-
i/2;
>-2):
x ^ 2.
o
6.B.25. Integrate
(a) J^dx,x^2;
(c) /^s^,x68;
(d) I{JX:7m2 dx>xeR-
Solution. Cases (a), (b). We have
f-^dx
J  x — 2
y = x-2 dy = dx 6 In I a;
/§ dy = 61n|y| + C:
v
2\ + C
and similarly
/ (x
(x+iy
■ dx
■
dy = dx 3
(x+4y
= J$dy = ■ + C.
6
-2y2
+ C =
We can see that integrating the partial fractions which correspond to real roots of a denominator of rational function is very easy. Moreover, without loss of generality we can obtain
a
f^dx
J    X — XQ
y = x - x0 dy = dx A\n\x
= Jf dy = Aln\y\ + C-
x0\ + C
and
/ lx
{x—xo
dx =
y = x - x0 dy = dx
+ C =
- + c
-n+l (l-n)(x-x0)
for all A,x0eR,n>2,neN.
Case (c). Now we are to integrate a partial fraction corresponding to a pair of complex conjugate roots. Thus in the
6.2.10. Upper and lower Riemann integral. In the first jj> I. step for proving the existence of the integral, we use an alternative definition, in which the choice of representatives and the corresponding value /(&) is re-fS 1 placed by the suprema Mt of the values f(x) in the corresponding subintervals [xi_1, x{], or by the infima m, of the function j(x) in the same subintervals, respectively. We speak of upper and lower Riemann sums, respectively (in literature, this process is also called the Darboux integral).
Because the function is continuous, it is bounded on a closed interval, hence all the above considered suprema and infima exist and are finite. Then the upper Riemann sum corresponding to the partition E = (x0,..., xn) is given by the expression
n
S~,sup = SUp     f(Q)(Xi - Xi_l)
n
= ^Mi(xi -
The lower Riemann sum is
inf f(0)(xi-xi -i<«<*i
^ ^ nij (Xj Xi — i=l
For each partition E = (xq, ..., xn sentatives, there are the inequalities
(1) 5=
Ci,...      with repre-
inf ^ ^B,^ ^ *S*S,sup
Moreover, the infima and suprema can be approximated with arbitrary precision by the actual values of terms in the sequences. Thus, we might suspect that the Riemann integral exists if and only if for all sequences of partitions with norms approaching zero, the limits of both the upper and lower sums will exists and they will be equal. This is indeed true for all bounded functions:
Theorem. Let the function f be bounded on a closed interval [a,b]. Then
*S*sup = inf S's.sup i *S*inf = SUp Ss,ud
are the limits of all sequences of upper and lower sums with norm approaching zero, respectively.
The Riemann integral of the function f exists if and only
if *S'sup = 'S'iiif-
Proof. First, notice that Ssup is well defined since it is the infimum of a set of real values bounded from below by any of the S^mf. Similarly for the value S[nf, which is bounded from above by any of Ss,suP ■
Refine a partition E1 to E2 by adding new points. Then
By the definition of the infimum, there are sequences of partitions Ek for which Ssup is the limit of the sums S~k^up.
403
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
denominator there is a polynomial of degree 2 and in the numerator at most 1. If it's of degree 1, we'll write the partial fraction so that we'll have a multiple of the derivative of the denominator in the numerator and add to it the fraction, in whose numerator there is only a constant. This way we'll obtain
r     3x+7      j   _ 3 r     2x-4      j    i  -i o r_dx_
J  x2-4x+15 aX — 2 J  x2-4x+15 "X + 10 J x
fin (a;2-4a;+ 15)+ 13/ (g_2)a+11
4X + 15
dx
■ In (a;2 - 4a; + 15) + if / ■
dx
x-2
dy=dx
| In (a;2
4a; + 15) +
|ln(a;2-4a; + 15) + ;ifj/^ ^7= arctg y+C = § In (a;2 - 4a; + 15) + ^= arctg ^=+C.
Again, we can generally express
Ax+B
■ dx
I T^M^ dx+(B + Ax0) / dx
{x—xo) +i
and compute
J (x
2(x—x0)
(x—x0) +a2
dx ■
y = (x-x0) +a2 dy = 2 (x — xq) dx
• in v
In | y | + C = \n[(x — x0) I {x-xly+a2 dx = jz I ^-?*y
+ a2} + C,
_ X-Xg
dz=^
a J z2 + l
i arctg z + C = i arctg ^ + C,
i.e.
Ax+B
dx
oV+a-
+ arctg 3^ + C,
4 In ((a; - a;0)2 + a2)
where the values A, B, x0 e K, a > 0 are arbitrary.
Case (d). All that is left are the partial fractions for multiple complex roots in the form of
[(g-ffi+aT. A,B,xoeR,a>0,neN\{l}, which can be analogically simplified to
2{x~Xo)— + (B + Ax0)
2 ' l(x-Xo)2+a2]n
Then we'll determine
2{x-xa)-dx =
-xay+a< 1
- + c =
y=(x
dy = 2 (x — xq) dx _l_
(l-n)[(x-x0)2+a2}
[(x-x0y+a2]n ■
a;0)2 + a2
—+ C
and
(zo, a) — / T(^
_ [(x-xay+a2
G'{x) = 1
- dx =
-2ji(x—xq)
[(x-x0)2+a2]' G(x) = X — Xq
[(x-x0)2+a2Y
+ 2nJ
{x—x0) +a [(x-x0)2+a2Y
— dx =
[(x-x0)2+a2]'
[{x-x~y°+a2^ + 2n {Kn {x0, a) - a2 Kn+1 (x0, a))
Moreover, every two partitions have a common refinement. Thus it may be assumed that Sk in the sequence is always a partition obtained by refining the previous one. Hence the sums Ssk,suP form a non-increasing sequence of real numbers converging to Ssup.
A similar argument applies to S[nf. Hence the values
*S*sup = inf S's.sup i *S*inf = SUp Ss,ud
are good candidates for the limits of upper and lower sums.
Next, consider a fixed partition E with n inner partitioning points of the interval [a, b], and another partition E\, whose norm is a small number S. In the common refinement E2, there will be only n intervals contributing to the sum Ss2,suP by eventually smaller contribution than in the case of E1. Now, / is a bounded function on [a, 6] and thus each of these contributions will be bounded by a universal constant multiplied by the norm S of the partition. Hence when choosing S sufficiently small, the distance of Ss± iSup from Ssup will not be larger than twice the distance of Ss,SuP from Ssup.
Finally, return to the sequence of partitions Ek as chosen above, and choose an e > 0. Then there is some mei such that the distance of S~kiSup from Ssup is less than e for all
with appropriately
sup from S'sup does not
k > m. Hence for arbitrary partition small norm S > 0 the distance of S. exceed 2e.
In summary, for arbitrary 2e > 0, there is S > 0 such that for all partitions with norm at most S the inequal-
ity
< 2e holds. This is exactly the statement
that the number SSt±v is the limit of all sequences of upper sums with norms of the partition approaching zero.
The statement for lower sums is proved in exactly the same way.
It remains to deal with the existence of the Riemann integral Ja f(x) dx. If Ssup = •S'mf, then all Riemann sums of sequences of the partitions have the same limit because of the inequalities (1).
If the Riemann integral does not exist, then there exist two sequences of partitions Ek and Ek and their representatives with different limits of Riemann sums. Suppose the first limit is larger then the other one. Then the upper Riemann sums can be selected for the first sequence and the lower Riemann sums for the second sequence. Their difference will then be at least as large. In particular, in view of the previous
part of the proof, this implies Ssup > Sj
inf ■
□
6.2.11. Uniform continuity. Until now, we have only used the continuity of the function / to know that all such functions are bounded on a closed finite ^ interval. It remains to show that for continuous
functions 5*,,
Sir
^sup      *Jinf ■
From the definition of continuity, for every fixed point x e [a,b] and every neighbourhood Oe(f(x)) there exists a neighbourhood Os(x) such that f(Os(x)) c Oe(f(x)). This statement can be rewritten in this way: for y,z e 0$(x), i.e.
\y-z\< 2S,
404
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
which implies
Kn+i (x0,a) =
which clearly also holds for n = 1. The last recurrent formula can be extended with the integral (derived in case (c))
Kx (x0,a) = iarctg^ + C.
In the given problem we have
J  (x2-6x+13)2 ~~ 15 / (x2-fL+13)2 dx + 13 j" (x2-6x+13)2
(x2-6x+13)
and furthermore
■ dx
S (x
(x2-6x+13)
2 dx = -i + C
2/
y = x2 — 6x + 13 dy = (2a — 6) dx + C,
— f Ú3L —
— J v2 ~
' x2-6x+13
■dx = J
dx
[(x-3)2+22Y
J (x2-6x+13)
J_ lIzzl fc (o 2) i I_£=3_
22 ^   2   AH^^+ 2 (x-3)2+22 x—3
i  larctg^ + C+i^
2-6x+13
iarctg^+ 1
x—3
8 x2-6x+13
In total, we have
f 30x-77 J 02-6x+13)
■ dx = —
15
x2-6x+13
13        x — 3 \ —  13 a,v>trr ■c~3 I
8  x2-6x+13 + ° - le31^   2 +
+ i| arctg ^ +
16        6 2
13x-159 , ^ 02-6x+13) ^ °"
□
6.B.26.   Integrate the rational functions
(a) f^0Tf dx,x=£0,x=£l;
(c) / (x-4)(x-2)(x2+2x+2) da, a 7^ 2, a 7^ 4;
(d) / x4_JLx+i da, a ^ 1;
(e) / (x2+Xix+i3)2 dx> x(^R->
© I x4-12x3+62x2-156x+169 ^ 1 G ^,
Solution. We'll compute all the given integrals in the way we can always use when integrating rational functions. We won't use any specific simplification or substitution. Even the recurrent formula for Kn+1 (a0, a), which we derived in a general form, will be used only for x0 = 0, a = 1 (and also when n = 0). Using the aforementioned procedures, we obtain
(a)
(b)
f -ptjW dx = 2f^T+f r-^w + 2 f ^
J — I)'3 J   X — 1        J   (x— l)z J   (x — 1
= 2 In I a: — 1 I
x-l 0-1)
In I a; I + C;
it is true that/(y),/(2) G Oe(f(x)), i.e.
A global variant of such a property is needed; it is called the uniform continuity of a function /:
Uniform continuity
Definition. Let / be a function on a closed finite interval
[a, b}. / is uniformly continuous on [a, b], if for every e > 0 there exists S > 0 such that for all z,y e [a, b] satisfying \y — z\ < S, the inequality \f(y) — f(z) | < e holds.
Theorem. £ac/; continuous function on a finite closed interval [a, b] is uniformly continuous.
Proof. Fixing some e > 0, the definition of continuity of / provides for each x G [a, b] the values S(x), such that f(y) G Oe(f(x)) for all y G 02s(x)(x). Since every finite closed interval is compact, it is covered by finitely many of such neighbourhoods Gs(x)(x), determined by points x1,..., Xk. Choose S as the minimum of all the (finitely many) S(xi).
Choose any two points y,z G [a, b] with y — z\ < S,
they both belong to one of 025(xI)(a;i)- Thus \ f(y)—f(z)\ < \f(y) — f(xi) \ + \f(xi) — f(z)\ < 2e and / has the desired property. □
6.2.12. Finishing the proof of Theorem 6.2.9. Now we
complete the proof of the existence of the Rie-mann integral. Choose e and S as in the definition of the uniform continuity of /. Consider any partition E with n intervals and norm at most S. Then, writing J, = [xi_1, x{],
Í€J,
= E(SUP f(0 - M m))(xi - Xi-i) <e-(b-a).
405
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
5x2+6x+3 1_ r    10x+6     j„_ 23 f 10 J 5x2+6x+3 5 J
da;
■
iln(5a;2 + 6a; + 3)-i/I^T; ^ In (5a;2 + 6a; + 3) - f /-^
_ 5x+3
-2- d
^ In (5a;2 + 6a; + 3)
t ■ dt ■
• dx
iln (5a;2 + 6a;+ 3) - ^
(^) +i
23\/6 r dt 30   J t2 + l
i In (5a;2 + 6a; + 3) -
30 23\/6 30
arctg i + C =
arctg ^2 + c;
(c)
/ (x-
dx
_ J_ r _dx___i_ r ,
52 J  x-4       20 J  x-2 "T"
(x-4)(x-2)(x2+2x+2)       52 J  x-4       20 J x-2
_J_ r    4x+U    J™ _ J_ ln I T _ A__L ln I T _ 9 I +
130 J  x2+2x+2 52      I * I       20 111 I X      Z I +
2x+2     j„  i  17 f dx
130 ( 2 / x*+2x+2 ^ + 7 / x-
2+2x+2
J- ln I (*~4)5 I + 260 111 J (x-2)13    + 130 '
7    f dx 130 J  (x+l)2 + l
+2x+2
t = X + l
dt = dx
J-ln
260
^-ln
260
(x-4)5(x2+2x+2)4 (x-2)"
(x-4)5(x2+2x+2)4 (x-2)13
1__7_ r dt
~l~ 130 J t2 + l
+
1 260
In
(x-4)5(x2+2x+2)4 , ,
^-(^_2)i3-L + 14 arctg (a; + 1)
130 J t2 + l
jlo arctgi + C =
1 +C;
(d)
r       x i  _ l r   dx    _ l r
J   a;4_a.3_a. + 1   «a —   g j   (a._1)2 3 J   3;2+a.+ 1
dx _
3(x-i;
l     _ l r_d
3(x-l)      3 J  (x+i)2 + f ~
j _ 2x+l
± r.
9 J
+ 1
3(x-l
l 2    r lit
J t2+l
3%/^S •
3(x-l)
(e)
3(x-l) 3v/3
^arctg^ + C;
dt = -j^dx
2 arctg i + C:
f _2x+J_ .   _  r _2x+4_ . _
J  (x2+4x+13)2  aX — J  (x2+4x+13)2 aX
3/(x /f-3/
dx
(x2+4x+13)2 dx
[(x+2)2+9y
t = x2 + Ax + 13 di = (2a; + 4) da; -I-If
t       27 J
dx
x + 2 '
+ 1
u du
l
" x2+4x+13
9/(11
_ x+2 ,3
= 3 da;
1 Marctgu+i_^_)+C.:
du (u2+l)
x2+4x+13 _1
9
1
1+2
x2+4x+13 18
_± arctg ^2
18
x+8
6 x2+4x+13
+1
+ C;
+ c =
(f)
/ x4-
5/x
5x2-12
12x3+62x dx
2-6x+13
da;
5x2-12
156x+169
30x-77
+ f(x-
(x2-6x+13)2
; / (x2-6x+13)2
da; = 5 f 7- dx
J (x
da;
(x-3)2+4
+
~aZ f(t)dt-
f(t) dt   - f{x)
x-\-Ax
— fit) dt  - fix)
For decreasing norm of the partition, the upper and lower sums are arbitrarily close to each other. In particular the upper Riemann integral and the lower Riemann integral coincide.
To complete the proof of the fundamental theorem of integral calculus, it is still needed to verify the statement about the existence of a primitive function. For a continuous function / on interval [a, b] there exists the Riemann integral Fix) = f* fit) dt for every x e [a, b]. As in the statement about uniform continuity, there is S > 0, dependent on a fixed small e > 0, such that
\f(x + Ax)-f(x)\<e
for all 0 < Ax < S on the interval [a,b]. The difference of the derivative of F(x) and the integrated function fix) is expressed by the limit of the expressions
^    / fX-\-Ax
Ax 1 Ax
for Ax approaching zero.
Now choose 0 < Ax < S and replace the integrated function by the constant value fix). Then the values /(£) at any point £ G [x, x + A] are distant from fix) by at most e. Hence the Riemann integral in question cannot be different from fix) A by more then eA. Thus, we arrived at the following estimate:
^    / /*x+Zix
1 Ax
But that means that at the point x, the one-sided right derivative of the function F(x) exists and equals fix). The result for the left derivative is proved in the same way, just working with the interval [x — A,x]. This finishes the proof of the theorem 6.2.9.
6.2.13. Important remarks. (1) Theorems 6.2.9 and 6.2.8 claim that the Riemann integral is a linear map
C[a, b]^R
from the vector space of continuous functions on the interval [a, b] to real numbers. Hence it is a linear form on the space
C[a,b].
(2) We proved that every continuous function is a derivative of some function. Hence the concepts of the Newton and Riemann integrals coincide for continuous functions. Therefore the Riemann integral of continuous functions can be computed as the difference of values F(b) —F(a) of the primitive function F.
(3) In the first step of the proof of the theorem 6.2.9 we proved the important statement that for bounded functions / on finite intervals [a, b] the limits of the upper and lower sums always exist. They are called respectively the upper Riemann integral and the lower Riemann integral and they are also denoted by Jaf(x) dx and Jb fix) dx.
fit) dt   - fix) < e.
406
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
1     f        2x-6 I     ,  i q f rix
idJ  (x2-6x+13)2 «a-l-lOj (a.2_ga.+13)
dx
)+1
13/
+ 15/7J
(x2-6x+13)
■ da +
i =
x—3
dt = Tjdx
[(*-3)2+4r
w = a2 — 6a + 13 du = (2a — 6) da
5 r  dt   I it fin I 13 r
2 J t2 + l + id J u2 "T" 16 J
dx
larctgt-f + f/^
(^)2+i = farctg V
x2 —6x+13
+ # (iarctgf +
— 2 '
1 t
2 t2 + l
2
+ C =
2arCt8X23 x2-6x+13_r16
15
+ i arctg ^ + § ■
- +
+i
C = § arctg ^
8 (x-3)2+4
+ if arctg ^ 1 arctg      + .
__±Ü__L
x2-6x+13 ^ 13x-159      I ri (x2-6x+13) + °"
□
6.B.27. Compute
/ (x-l)2(x2+2x+2) X =£1.
202+2x+2)
Solution. Because the degree of the polynomial in the denominator is lower than in the numerator, these polynomials don't have a common root and we know the representation of the denominator in the form of a product of root factors, we know the form of the decomposition of the integrated function into partials fractions
_A__,       b       , Cx+d
(x-l)2(x2+2x+2)       x-1       (x-1)2 x2+2x+2
for A, B,C, D e K. If we multiply this equation by the denominator of the left hand side, we'll obtain the identity
x = A (a - 1) (a2 + 2a + 2) + B (a2 + 2a + 2) + (Cx + D) (a — l)2 ,
which hold for all a e R\{1}. But on its both sided there are polynomials, so the equality must also occur for a = 1. By substituing this value we immediately get 1 = 5(1 + 2 + 2), i.e. 5 = 1/5.
We could choose other real (eventually complex) numbers and substitute them into the given equation, but we cannot expect to directly determine another of the variables (if we don't substitute a root of the denominator). Thus we'll rather compare the coefficients at the same powers of the polynomials
I („2
x - I (a2 + 2a + 2)
_ .1 rp2    I     3   ™ _ 2
5^   ~t~ 5 x 5'
A (a - 1) (a2 + 2a + 2) + (Ca + D) (a - if = (A + C) a3 + (A - 2C + D) a2 + (C - 25) a - 2A + 5,
In this way we can define the Riemann integral for continuous functions as in the above proof.
(4) We derived the important property of continuous functions called the uniform continuity on finite closed intervals [a,b]. Clearly every uniformly continuous function is continuous as well, but the converse is not true on open intervals. As an example, consider the function /(a) = sin(l/a) on the interval (0,1).
(5) Consider a function / on an interval [a, b], which is only piece-wise continuous. This means that / is continuous in all points c e [a, b] except for finitely many discontinuities Ci, a < c{ < b, in which it has finite one-sided limits. Because of the additivity of the integral with respect to the interval of integration, see 6.2.8(1), the last theorem implies that in this case the integral
5(a) = fX f(t)dt
exists for all a e [a, b] and the derivative of the function 5(a) exists at all points a where / is continuous. It can be verified that 5(a) is continuous at the remaining points. So it is a continuous function on the whole interval [a, b]. When evaluating the integral by primitive functions, it is necessary to choose its individual parts so that they are connected continuously at the points . Then the entire integral can be again computed as a difference of the function 5(a) at its boundary values.
(6) Lagrange's mean value theorem for differentiable functions has an analogue which is called the integral mean value theorem. Suppose /(a) is continuous on an interval [a, b] and its primitive function is 5(a). The mean value theorem claims that there exists a point c, a < c < b such that
b
/(a) da = F(b) - F(a) = F'(c)(b - a) = f(c)(b - a).
This statement can be derived directly from the definition of the Riemann integral. It can be used in the final step of the proof of the fundamental theorem of integral calculus.
6.2.14. Differential equations. Theorem 6.2.9 can be understood in the following way. Given a continuous function / (a) on a bounded interval [a,b], the set of all functions y of one variable a satisfying the equality
y' = f(x)
is given by the formula
f(t) dt + c
with the constant C = y(a). This is the simplest instance of differential equations. More generally, ordinary differential equations of first order are given as
y' = f(x,y),
where /(a, y) depends on two unknown variables a and y. A solution to this equation is a function y = y (a), such that the
407
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
which leads to a system of equations
0 = A + C,
-1/5 = A -   2C + D,
3/5 = C - 2D,
-2/5 = -2A + D.
Note that this system must have exactly one solution (which is uniquely determined by any three of the given equations). The sought solution is then
A = 15' C = D
Thus
25 '
/ (x-l)'i(x'i+2x+2) dx —
f _dx__|_ f _dx      _  f _x+S_ I    _ J_ l    I _
J  25(x-l) ^ J  5(x-iyA      J  25(x2+2x+2) aX ~~ 25 111 I X
1\- W=Tj - ^5ln (-2 + 2a: + 2) - ^ arctg (a: + 1) + C, where we used
f     x+8     j   _ r j(2x+2) _7_ , _
J  x2+2x+2 U"L ~ J  x2+2x+2 ^ x2+2x+2 aX —
1 r     2x+2     j .
2 J x2+2x+2 U"L ^
7/ fx+iy+i dx = 5ln(22 + 22 + 2)+7arctg(2 + l) + C.
□
6.B.28. Determine
(a) J ^t^-xXT1 dx,xeR;
(b) JJ^dx,x^ ±1.
Solution. Case (a). First we must do the division of polynomials
(x3 + 2x2 + x - 1) : (x2 - x + 1) = x + 3 + J^+j,
to consider a proper rational function (with the degree of the numerator lower then the degree of the denominator). Now we'll compute
f *3+a*2+*-i dx = fx + 3dx + [4^dx =
J Xz — X+l J J    Xz — X+l
dx_ _
Sl   I  Sr 4- 2 f    2a;-1    j    _ 5 f _
2   + 0X + 2 J  x^-x+l 2 J l)2+^_^2 ~~
^ + 3x + | In (a;2 - x + l) - ^ arctg ^ + C. Case (b). We have f-^dx^fldx + lf-^- ±fJe--±f-j*- +
|ln|2 + l|-iarctg2 + ^Jx22_^f+ld2-
%/5  f 2x+%/2
1  f__ f     2x+V2 j
-_-=2+1;ln|2 — II — i In I a; + 1 I
dx
I/"
i arctg2 +     In (22 - v^x + l) - ^ arctg (y/2x - l) -In (22 + V2x + 1) - ^ arctg (v^x + l) + C.
equality is true upon substitution. Similarly, dependence on higher derivatives of y may be included.
We return to this concept in Chapter 8, see 8.3.2. For the present, we discuss one very special type of equation with separated variables
y' = f(x)g(y)
and add a few observations concerning analytic solutions. Rewrite the equation in terms of the differentials, cf. 6.1.11,
1
g(v)
dy = f(x) dx.
Find the primitive functions on both sides to determine the unknown function y = y(x) implicitly.
Indeed, if G(y) and F(x) are the primitive functions with G'(y) = and F'(x) = f(x), and y(x) satisfies G(y(x)) = F(x), then differentiating both sides with respect to 2 yields
Q = G\y(x))y,(x)-F'(x) =
y'(*)
-f(x)
g(y(x))
as expected. Of course, it is necessary to be careful with the values y for which g(y) = 0, which need to be discussed separately.
For example, the equation y' = y leads to the implicit definition
In \y \ = x + C,
which for positive y provides y(x) = D ex with positive constant D, the constant solution y = 0 corresponds to D = 0. Negative values of y correspond to negative constants D in the same expression.
If y(0) = 1, we recover the exponential y(x) = ex.
6.2.15. Analytic solutions. In the next part of this chapter, we shall prove that the power series are differentiated and integrated term by term, thus the solution y(x) to the equation y' = j(x) with a known analytic function j(x) =
n=0 itX 18
y(x) = Es+i'
+ 2/(0),
where y(0) is the free integration constant. The solution is denned on the coverengce domain of the power series. Of course we might use series centered in other points x0 if prescribing the initial value 2/(20). (We shall prove much later, that actually there is always the unique solution with the given initial prescribed value y(0) in Chapter 8.)
The latter equation y' = y had the analytic solution ex, too. Let us consider the general case of this type, i.e. equations of the form
(1)
v' = f(y)
with an analytic right-hand side f(y). Given the initial condition y (20) = j/o. straightforward differentiation with the help
408
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
6.B.29. Compute
f 2x4+2x2-5x+l    i i n
J    x(x2_x+1y    ax, x^u. Solution. We have
f 2xi+2x2-5x+l j _ J     x{x2-x+l)2    UX ~
lnlrl+± f   2x~1   dr I 7 f_^__i I f_2*=l_ j
I     l+ 2 J  x2-x+l 2 J  x2-x+l + 2 J  (x2-x+l)2 "X
11 r dx 2 J (x2-x-\
E + l)
t = X1
x + l
dt = (2x - 1) dx
ln|a;|+1;ln(a;2-a; + l) + |/
dx
I-
, 1 r dt + 2 J t2
14  r ___1__88 r
3 J   /2x-l\2 , ,       2t        9 J
In | a; Va;2 — x + 1 | +
dx _
+1
= 2*=i
du = -^dx l
In | a; Va;2 - a; + 1  + ^ / -Jfi
+i
A/3
44y^S 9
202-x+l
^arctgM-2(x._x+1
i
^pr = In I a; ^x2 -x + l \ +
In | x yfx2 -x + l | + ^ arctg
22 t   2x-l _ i
9    arCl8 2(x2-x+l)
V3
2x-l
22\/3
9 (^)2 + l 2x-l       1 llx-4
+ C =
\n\xyx2-x + l\-^ arctg ^ - A-™ + C.
□
6.B.30. Integrate
(a) / da;,
(b) f   , 3 iln,x2-t- dx,x>0,x^ e.
v '  J  shd x+x In2 x — 2x       ' ' '
Solution. Case(a). The advantage of the method of integrating rational functions described above is its universality (using it, we can find primitive functions of every rational function). Sometimes though, using the substitution method or integrating by parts is more convenient. For example,
/ i+U dx ~
V = x dy = 2x dx
\ arctg y + C = \ arctg a;2 + C.
— C    dv    —If dy
~ J 2(l+y2) ~ 2 J 1+y2
Case (b). Using substitution, we obtain an integral of a rational function
f 5 In x_
J  x In3 x+x In2 x — 2x
= 1
5 In 3
y — \nx dy — - dx
& X
5y
y3+y2-2
S^dy-lf^_dy + 3f
In3 x+ln2 x-2
dy = J1-hI +
: dx = -y+2
dy =
(y+i)2+i
y2+2y+2
2 dy = In | y
of the chain rule and the equation (1) shows
y'(zo) = f(yo)
y"(X0) = f(y)y'\x=x0 = f(y)f(y)\x=x0 = /'(yo)/(yo) v'"(x0) = (f"(y)y'f(y) + f(y)f(y)y')lx=xo = /"(y0)(/(yo))2 + (/'(yo))2/(yo)
Two crucial observations are due here. First, giving the initial condition y(xo) = yo, all derivatives y^^xo) are given at this point by the equation. Thus, if an analytic solution exists, we know it explicitly. So we have to focus on the convergence of the known formal expression of the series v(x) — J2n°=o ~h.yl'n\xo){x ~xo)n and we arrive at the theorem below.
In its proof, the second observation will be most helpful: the expressions for the derivatives y^ are universal polynomials Pn
y^\x) = Prl(/(a;),/'(a;),d0iS,/(™-1)(a;))
in the derivatives of the function /, all with non-negative coefficients and independent of the particular equation.5
Cauchy-Kovalevskaya Theorem in dimension one
Theorem. Assume f(y) is a real analytic function convergent on the interval (x0—a, x0+a) C K and consider the differential equation (1) with the condition y(xo) = yo- Then the formal power series y(x) = E^=o ^\V^ (xo)(x~xo)n converges on a neighborhood of x0 and provides the solution to (I) satisfying the initial condition.
Proof. The second observation above suggests how to prove the convergence of the "candidate series" similarly as we proved the convergence of power series in general, i.e. by finding another converging series whose partial sums will bound our's from above. This was the original Cauchy's approach to this theorem and we talk about the method ofmajo-rants.
Without loss of generality we shall fix x0 = 0 and y(0) = 0 (we may always use shifted quantities z = y — y0 and t = x—x0 to transform the general case). Assume we can find another analytic function g(x) = J2n°=o hbnXn w*m aU bn = g^ > 0, i.e. g has got all derivatives non-negative at the origin, such that g^ (0) >       (0) | for all n.
Now, replace / in the equation (1) by g and write formal
power series z(x) = J2n°=o ^nT1™ f°r tne potential solution of this equation as above. In particular, we deduce (recall the
Although we shall not need the explicit formulae for these polynomials, they are well known under the name Fad di Bruno's formula. In principle, they are direct generalization of the Leibniz rule to higher order derivatives.
409
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
1 | - \ In (y2 + 2y + 2) + 3arctg (y + 1) + C = In | In a; -1 | - 5 In (In2 x + 2 In x + 2) + 3arctg (In a; + 1) + C.
□
For an arbitrary function / that is continuous and bounded on a bounded interval (a, 6), the so called Newton-Leibniz formula
b
(1)     / f(x)dx=[F(x)]bn := lim F(x) - lim F(x)
J x—yb— x^ta+ a
holds, where J7"(a;) = f(x),x e (a,b). Emphasise that under the given conditions, the primitive function F always exists and so do both proper limits in (1). Hence to compute the definite integral, we only need to find the antiderivative and determine the respective one-sided limits (eventually only values of the function, if the primitive function is continuous at the boundary points of the interval).
6.B.31. Determine
(a) dx,x>0;
(c) f A dx,xeR
v 7  j   x  V x — 1 '
-1.1]
(d) / (_+4)^+3--4 dX> X e U 0-, +00);
(e) J1+v_x.+x+2dx,xe(-l,2);
(f) f 7—t \ / _,       da;, a; 7^ 1.
Solution. In this problem, we'll illustrate the use of the substitution method while integrating expressions containing roots. Case (a). If the integral is in the form of
/ / ( Filj/x, v{2l/x, • • •, FllVx) dx for certain numbers p(l),p(2),... ,p(j) e N and a rational function / (of more variables), the substitution tn = x is suggested, where n is the (least) common multiple of numbers p(l),..., p(f). Using this substitution, we can always reduce the integrand (the integrated function) to a rational function, which we can always integrate. We'll get
dx
r7 J
J _io
dx
J x(^+jx^
10/
(F+FT
t2 + t:
dt
t10 = x, 1 10f9 dt ■
10/■ dt -
Ix = t - dx
t6+t5
j- +
t4 + t5
10 [Int + 1 - ^ + ^ - ^ - ln(l + .)] + _ =
ln-
+
10
5    , 10
^ + 3 \?_
(1+ ^)10
Case (b). For integrals
J7 (x, p<1v^T&, p<2v^+~&,
p0^/a_ + &) da;,
universal polynomials Pn have got non-negative coefficients)
z^(0) = Pn(g{z(0)),...,g^n-1\z(0)))
>P„(|/(y(0))|,...,|/("-1)(y(0))|) > |yW(0)|
and, consequently, convergence of z(x) will imply absolute convergence of y(x), i.e. the claim of the Theorem. We try to find a majorant in the form of a geometric series.
Let us pick r > 0, smaller than the radius of convergence of /. Then obviously, there is a constant C > 0 such that the derivatives an = f(n\0) satisfy \^anrn\ < C for all n, i.e. \an\ < C^k (the series would certainly not converge otherwise). We may recognize the derivatives of a geometric series and write
(2)
r
r — z
with derivatives g^> = C^.
Finally, we have to prove that the solution of the equation z' = g(z) is analytic. We can easily integrate this equation with separated variables directly. Written with the help of differentials, (r — z)dz = Cr dx. Thus, the implicit equation reads ^(r — z)2 = —Crx + D, where the constant D is determined by z(0) = 0. Consequently D = \r2 and a simple computation reveals the solution of the implicit equation
2Ca;
{x)=r l±Wl-
The option with the minus sign satisfies our initial condition. This clearly is an analytic function, g provides the requested majorant, and the proof is finished. □
6.2.16. Improper integrals. When discussing the integration of rational functions /, there is a need to consider definite integrals over intervals, where j(x) has improper (one-sided) limits. Here / is neither continuous nor bounded. Thus earlier definitions and results may not apply. We speak of "improper" integrals.
A simple solution is to discuss the definite integral on a smaller sub-interval, and determine whether the limit value of such a definite integral exists when the boundary approaches the problematic point. If it does, the corresponding improper integral exists and equals this limit. We illustrate this procedure by an example:
I =
dx
(2
U/4-
This is an improper integral, because the integrand j(x) = (2 — a;)-1/4 = 4/(2__) has its left-sided limit oo at the point b = 2. The integrand is continuous at all other points. Thus,
410
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
where again p(\),... ,p(j) G N, / is a rational expression and a, b e R, we choose the substitution tn = ax + b while preserving the meaning of n. In this way, we'll get
tJ
■ 3a;+ 1
dx = t2 dt
I
-t2 dt
t- + t2
5 + 1
+ C =
H&Tidx j£=l±3tdt=LJt* + 2tdt=L
^(^ + l)+g=j^g(^ + l)+g = {/(3a; + l)2 2±2 + C.
Case(c). Another generalizations are the integrals of the
type
f f(x, P<1)/«£±|, p(2)/a£+b pm/ox+bA ^
J ^ \   '      V cx+a'      V cx+a'       '     y cx+d I '
with the only additional condition on the values a, &, c, d e R being ad — be ^ 0. Preserving the meaning of the aforementioned symbols, we now put t71 =       . Specifically,
dx ■
cx+d' 2 _ x+1 x-1 „ _ t2+l J ~ t2-l At
~(t2
-1
dt
r t*-l -At1 j, _ r -At' j, _ J t2+l (t2-l)2 ai — J (t2+i)(t2-i) ut —
In | i + 1 | - In 11 + 1 I - In
V x— The simplifications
i
t-i
2
t2 + l
di = - 2 arctg f + C = l|-2arctg Jf±f + C.
In
In
In
V\ X + 1 l+y/l X-1
\/| x+1 — \/l x—1 (v/l x+1 l + v/l x-1 |)2
m -^-n—rn—i-in =
I x+1 I — I x —1 I
21n(y|a; + l| + ,/| a; - 1 |) -In 2 for x e (—oo, —1) U (1, oo) then allow to write
J  x y x —1
21n(y|a; + l| + y'l a; - 1 |) - 2 arctg ^f^T + C. Cases (d), (e), (f). Now we'll focus on the integrals
J f (x, \/aa;2 + &a; + c) da;,
where we expect a/0 and &2 — 4ac 7^ 0 for otherwise arbitrary numbers a,b,c e R. Recall that / is a rational expression. We'll distinguish two cases, when the quadratic polynomial aa;2 + bx + c has real roots and when it doesn't.
If a > 0 and the polynomial ax2 + bx + c has real roots x1,x2, we'll use the representation Vaa;2 + bx + c = y/a \J{x
x1)
la \ x — x\
X — X2
x—xi
for 0 < 8 < 2, consider the integrals (substituting y = 2 — x)
<-2-S        . p2
h =
dx
0
2 A
y
-1/4
dy
h3/i
I[23/4-<S3/4].
Notice that dy = —dx. When x = 2 — <5, y = 8. When
a; = 0,y = 2.
The limit when 8^0 from the right clearly exists, so the improper integral is evaluated.
I
da;
^2^~
l23/4.
3
We proceed in the same way to integrate over an unbounded interval. In this case, we speak of improper Riemann integrals of the first kind. The integrals of unbounded functions on finite intervals are improper Riemann integrals of the second kind.
More explicitly, for a e R
1 =
f(x)dx= lim / f(x)dx, b—yc - '
if the integrals and limit on the right hand side exist. Similarly we can have a finite upper bound an infinite lower bound. If both a and b are infinite, we can evaluate the integral as a sum of two integrals with a chosen fixed bound in the middle as in
f(x) dx = I    f(x) dx + I    f(x) dx
Its existence and its value do not depend on the choice of such bound, because by changing it, we only change both sum-mands by the same finite value, but with opposite sign.
At the same time a limit for which the upper and lower bound would approach ±00 at the same speed can lead to different results! For example
x dx =
1
= 0,
even though the values of the integrals J x dx with a fixed and b —> 00 diverge to infinity.
The integrated functions may have more discontinuities with infinite one-sided limits. The interval of integration may be unbounded. Then the integration intervals must be split in such a way that the individual intervals of integration include only one of the above phenomena.
Hence when evaluating the improper integral of a rational function, divide the given interval according to the discontinuities of the integrated function. Then compute all the improper integrals separately.
6.2.17. New acquisitions to the ZOO. It might seem that in-, J11, definite integrals can be described in terms of known elementary functions. But this is false.
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
and letí2 = f_f^- If a < 0 and the polynomial ax2 + bx + c has real roots x1 < x2, we'll use the representation
Vax2 + bx + c -
xi)
andleti2 = f^rjf- Ifthepolynomialaa;2+&a;+cdoesn'thave real roots (necessarily for a > 0), we choose the substitution
On the contrary, nearly all continuous functions lead to integrals which we cannot express in this way. Functions obtained by integration often appear in applications. Many of them have names and there are efficient methods how to approximate them numerically (we shall come to this point briefly in 6.3.11 below).
In the methods of signal processing, the function
Vax2 + bx + c = ±^/a ■ x ± t
with any choice of the signs. Note that we of course choose the signs so that we get as easy expression to integrate as possible. In all these cases, these substitutions again lead to rational functions. Hence
(d)
z(x) =
sin(a;)
dx
(x+4)Vx2+3x-4
dx
= f
dx
(x+4) x+4 ,, lot
J TZŠZWZ
V 1-t2 ) I 1-
(x+4)v'(x-l)(x+4) 2 _ x-1 x+4
X = T^72 - 4
ť
5
l-t2 "
dx — 7~i   .o\2 dt
(i-*2)2
dt ■
§sgn(l-i2) fldf^fsgn (£z )t + C
5 \ x+4
4- f
+4
(e)
dx
[ dx _ [
J  l + V~x2+x+2       J l+x/-(x-2)(x+l
dx
l+(x+l).
2-x
X
dx
t =
x+l
= F+T ~~ 1 -    ~6t dt
-6t
(t2+l) t2 + l
2y/5
I (t2 + l)(t2+3t+l
VE 2 4
(t2 + l)2 t2+3t+l 64 dt =
_____
dt =
dt ■
5 2t+3+%/5      *2 + l      5 _2t-3+%/5
-^ln|2í + 3 +v^l -2arctgf +
2^ In I -It - 3 + V5 I + C = lnk/|^f +3 + v/5|-2arctg,/f^ +
(f)
2y/5 5
2^ In
5
dx
2 \/§qŤ?+3-\/5
i-+3+%/5
SarctgW2^+ C;
(x-l) y'x2+x+l
\/ X2 + X + 1 = X + Í
x2 + x + 1 = x2 + 2xt + t2
+ 1
t2+2t-2 2Í-1
-2(t^ - t + 1) (2t-l)2
tA+2t-2 t__
2i-l 2i-l
t2+2t-2
is important (cf. the discussion of Fourier transform in 7.2.6). Check yourself that it is a smooth function with limit values
/(0) = 1, /'(0) = 0, /"(o) = -|.
This even function has its absolute maximum at the point x = 0. It oscillates with a fast decreasing amplitude as x approaches infinity.
The sine integral function is denned by
Si(a;) = /   sinc(i) dt. Jo
Other important functions are Fresnel's sine and cosine integrals
FresnelS(a;) = /   sin( ^Trt2)dt Jo
FresnelC(a;) = /   cos( ^Trt2)dt. Jo
The function Si(x) is shown in the left figure. Both Fresnel's functions are shown on the right.
An even more important way how to get new functions is to add some free parameter in the integral. One of the most important mathematical functions ever is the Gamma function. It is denned for all positive real numbers z by
r(z) = I e 10
dt.
It can be proved that this function is analytic at all points 0 < z e K. For small z e N, we can evaluate:
r(i) r(2) r(3)
-t] 00
Jo
-t j] 00
1
"* dt = 1
e"* t2 dt = 0 + 2 /    e"ť t dt = 0 + 2
412
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
V 3 t+l-V3
__ 1
3 t+1+%/3
dt =
^ln|í + l- V3|-^ln|í + l + v/3|+C,=
In
t+l-%/3 t+1+%/3
+ C
- _3
In-
^/x2+x+l-x+l-V3 \/x2+x+l-x+l+\/3
■+c.
□
6.B.32.   Using a suitable substitution, compute
-VE-i
U
\/5-l
+00
f —/f     d_,   _ e
J   x+\/x2+x-l '
Solution. Even though the quadratic polynomial under the root has real roots x\, x2, we won't solve this problem by substitution t2 = X~X2. We could proceed that way, but we'll
x—xi r
rather use a method we introduced for the complex roots case. That's because this method yields a very simple integral of a rational function, as can be seen from the calculation
\/x2 + x — 1 = X + t x2 + x - 1 = x2 + 2xt + t2
J
dx
x-\-y/ x2 -\-x — 1
T - _ + l
^ 1-24
ÓiX - 7Ť o
f -242+24+2 i. J (t+2)(l-2t)
(1-24)
2 4+2
1 1 24-i
í — 21n|í + 2| — |ln|i — ||+C =
Va;2 + x-l - x - 2 In (Va;2 + _ - 1 - _ + 2) -i In | y/x2 + x- l -x-\\+C.
Note that each recommended substitution (see the above problems) can be in most specific problems usually replaced by another substitution, which allows to obtain the result in a much easier way. An undeniable advantage of the recommended substitutions is their universality though: by using them, one can compute all integrals of the respective type. □
6.B.33.   For x > 0 determine
(2+5-)' \fxr^
dx;
(a) /
ml^dx; (0 J^dx.
Solution. All three given integrals are binomial, i.e. they can be written as
J xm(a + bxn)p dx for some a,i£l, ra,n,p£ Q. The binomial integrals are usually solved by applying the substitution method, lip £ Z (not necessarily p < 0), we choose the subtitution x = ts, where s is the common denominator of numbers m a n; if G Z and p Z, we choose a + &xn = .s, where s is the denominator of number p; and if m^k+p £Z(p^Z, 2i±± ^ Z),wechoosea + &a;rl = ^a;™,
Integration by parts reveals immediately
T(z + l) = zr(z).
Hence for all positive integers n this function yields the value of the factorial:
r(n) = (n-1)!.
The following figure shows the behaviour of the function
/(_0=ln(r(_: + l)).
If we draw the function x In x — x instead, there is not much difference to be seen. Hence it seems that the factorial
n! grows similarly to e
n In n—n
= nn e n. This is the famous
Stirling's approximation formula. More precisely, one can verify
/zirn
.n+^e~n <n\ <enn+íe-
In order to understand the qualitative behavior of such functions (e.g. their differentiability) we need to understand the limit processes much better. Before diving into this in the next section, we introduce several direct applications of the Riemann integral.
6.2.18. Riemann measurable sets. The definition of the Riemann integral is motivated by the concept of the area of rectangles in the plane with coordinates x and y. The definite integral Jb is designed to correspond to the area of the region bounded by the x axis, the values of the function y = j(x) and boundary lines x = a, x = b. Moreover, the area of the region above the x axis is given with a positive sign, while values under the axis lead to a negative sign.
From geometry, the length of an interval on the real line, and the area of a parallelogram determined by two vectors in the plane are basic concepts. This extends to the area of a parallelepiped in Euclidean vector space W1. The areas/volumes of other subsets are yet to be defined. For some simple objects like triangles, polygons and polyhedrons, their area is given naturally by the generally expected properties of area (invariance with respect to Euclidean motions and additivity with respect to finite union of disjoint objects). Open questions include: What are the "measurable objects" in the real line? How can the concept of area and volume be extended to higher dimensions?
The questions are answered partially with the help of the Riemann integral. First, measure the "volume" of one-dimensional subsets.
413
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
where s is the denominator of p. In these three cases, a reduction to an integration of a rational function is guaranteed. Hence we can easily compute
(a)
(b)
(c)
: Jx~i(2 + 5x)3 dx = = 4/(2 + 5f4)3 dt =
pel
x = t4 _ dx = At3 dt |
4 / (8 + 60t4 '+ 150f8 + 125i12) dt 4 (8t + I2t5 + f t9 + if i13) + C = 4 (8^ä+ 12^+ f if V7^3) + C;
da;
.     (t3 - 1) di
/i-pda = /a-i(l + ai)
1 + a;! = t3 x = (t3- l)4 da = 12i2 (t3 -if dt
I2jt6-t3dt = l2 ^-^+C =
12^(1 + <ß'f (i±^--i)+C;
1 + 2;4 = ^4 X = (i4 - 1)"' dx = -t3 (t4 - 1)"' dt
~ / (i-l)(i+l)(t2+l) df =
J tl_xdt
1
t-1
t+i + t2+l ' °^ —
-i (In j i - 1 j - In j t + 1 + 2arctgf) + C =
In
^5
++1+1
+ 2arctg  {/i + l
□
6.B.34.   For a e (- §, §), integrate
(a) f ^   T3r; . 2 da;
v 7  J   1+4 cos2 x+3 sm^ x '
(b) f i, 1 2 da;
v 7  J   1+sm2 x '
(c) f 75—-— da.
v 7  J  2—cos x
Solution. Integrals in the form of J f (sin a, cos a) da for some rational function / are usually solved by the substitution method. If/(sin a,—cos a) = —/(sin a, cos a), we choose t = sin a; if/(—sin a, cos a) = —/(sina, cosa), we choose t = cosa; and if /(—sina, — cosa) = /(sin a, cos a), then t = tga. If none of these equalities
We say that the subset A C K is (Riemann) measurable, if the function \ : "-   • "-
Xa(x) =
1 ifaeA 0 ifx(£A.
is Riemann integrable. That is, the (improper) integral
/oo Xa(x) dx -oo
exists (the finiteness of its value doesn't matter). The function Xa is called the characteristic function of the set A, the value m(A) is called the Riemann measure of the set A. Notice that for an interval A = [a, b] this yields the value
Xa(x) dx
dx
- a,
just as expected.
The elementary properties of the Riemann integral imply that this definition of "size" has the expected properties. The measure of a union of finitely many measurable pairwise disjoint subsets is the sum of their measures. In particular, every finite set A has zero measure.
If instead we choose a countable union, this property is no longer true. For example, consider the set Q of all rational numbers as the union of one-element subsets. While every set containing only finitely many points has a zero measure by our definition, the characteristic function xq is not Riemann integrable.
The upper Riemann integral of the characteristic set xa corresponds to the infimum of the sums of lengths of finitely many disjoint intervals, by which we can cover the given set A. The lower integral is the supremum of the sums of lengths of finitely many disjoint intervals that can be embedded into the set A. We proceed in the same way in higher dimensions when defining the Jordan measure. For the definition of area (volume) in higher-dimensional space we generalize the concept of the Riemann integral,. We return to this in chapter 8. We remark here that the area of a plane figure bounded by a graph of a function in the way described above is consistent with expectations.
6.2.19. Mean value of a function. For a finite set of n numbers, the mean value, or arithmetic mean, is obtained by summing the numbers and dividing by n. For a Riemann integrable function /(a) on an interval (finite or infinite) [a, 6], the mean value is defined by
*(/) =
1
b — a
f(x) dx.
By definition, m(f) is the altitude of the rectangle (oriented according to the sign) over the interval [a,b], which has the same area as that of the region between the a axis and the graph of the / (a). Hence the integral mean value theorem is true in general:
414
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
hold, the substitution t = tg f is used. We'll show it on the given integrals.
Case (a). In the denominator, we have
1 + 4 cos2 x + 3 sin2 x = 4 + cos2 x
and in the numerator only the sine function to an odd power, i.e. the substitution t = cos x, where dt = — sin x dx, allows to replace all the sines and cosines and thus obtain
r        sin3 x d       f sln x (1—cos2 x) d
■J   1+4 cos2 x+3 sin2 x J        4+cos2 x
I zimr1 dt = J (1 -       ) dt = t - I arctg \ + C ■.
4+ť
f arctg
cos X     j Q
Case (b). Because both the sine and cosine appear here to an even power, the substitution t = tg x leads to
i
i+t2'
dx — i-)-t2 dt,
by which we obtain
/ l+sln2 x = / i ,1+t2„ dt = f J^t2 dt =
l+t2
^ arctg (s/2i) + C = ±f arctg (v^tgx) + C.
Case (c). Now we'll use the universal substitution t tg |, where
sin a; :
2t
l+t2' 1 Then we can determine
dx
l-f l+t2 '
dx —    ^_ 2 dfj«
r *e_ = r -^bíij. dí = 2 r A
J  2—cos x       J  o— 1~ŕ. J í+otz
3í) + C = 2*1 arctg (v^tg f) + C.
24p arctg
Definite integrals.
6.B.35.   Compute the definite integrals
3" T
Jtg2xdx, J^dx.
f 0
Solution. For x ^ § + A;7r, where fc e Z, we have
J tg2 a; da; = tg x — x + C, as we have compute earlier. This implies that
□
71-/3
/
tt/6
/ tg2a;da;=[tga;-a;]^ = y3_f -        - f)
_2__7t
V3 6'
Of course, definite integrals can be also computed directly. For example, the substitution y = tg x yields
Proposition. If f(x) is a Riemann integrable function on an interval [a, b], then there exists a number m(f) satisfying
f(x) dx = m(f)(b-a).
6.2.20. Length of a space curve. The Riemann integral can »#»sa      be effectively used to compute the length of iK±^£      a curve in multidimensional Euclidean vector /'"r^M^i-   sPace       F°r tne sake of simplicity, we deal with a curve in R2 with coordinates x, y. Suppose a parametric description of a curve F : R -+ R2, F(t) = [f(t),g(t)]
is given. Look at it as a trajectory of a movement. Assume that f(t) and g(t) have piece-wise continuous derivatives.
By differentiating the map F(t) we obtain vectors corresponding to the speed of the movement along this trajectory. Hence the total length of the curve (i.e. distance traveled over time between the values t = a,t = b)'v& given by the integral over the interval [a, b], with the integrated function h(t) being the length of the vectors F'(t). Therefore the length s is given by the formula
h(t)dt= /   sj{f'{t))2 + {g'{t))2dt.
The result can be seen intuitively as a corollary of Pythagoras' theorem: the linear increment As of the length of the curve corresponding to the increment At of variable t is given by the proportion in the orthogonal triangle and thus at the level of differentials
ds=y/(g>(t))* + (f>(t))*dt.
In the special case when the curve is the graph of a function y = f(x) between points a < b, we obtain
8=   I     sjl + (f'(x))2dx
and at the level of differentials,
ds = xj\ + (y'(x))2dx,
just as expected.
As an example, we calculate the circumference of the unit circle as twice the integral of the function y = Vl — a;2 over [—1,1]. We know that the result is 27r, because it is defined in this way.
■ dx
s = 2      y/1 + (y')2 dx = 2 /   Jl +
1-x2
= 2
1
i VT
X*
-. dx = 2[arcsina;]]_1 = 27T.
If we instead use
y = \J r2 — x2 = r\J\ — {x/r)2
and bounds [—r, r] in the previous calculation, by substituting x = rt we obtain the circumference of the circle with radius
415
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
tt/3
tt/3
f tg2 X dx =    f dx =
JO J      cosz X
7r/6 7r/6
y = tgx;dy = -M-
l+tg2 x 1+y2
■ I
1/V3
iV3
1+y'
dy ■
I l-T±pdy=[y- arctgy]^ = ^ - f.
When doing the substitution, we only need to not forget to change the limits of the integral to values gained by substituting      = tg (tt/3), l/y/3 = tg (tt/6).
We'll compute the second integral by integration by parts for the definite integral. (Note that we also found the primitive function to function y = x cos-2 x earlier.) We have
ir/4
f —t— dx =
J    cosA x 0
tt/A
F(x) = x G'(x) = -4j-
V   / cos^
[xXgx]l,A - J tgxdx = [xtgx]g/4 + J cosx
0
F'(x) = 1 G(x) =tgx
ir/4
[X tg X]l/A + [In (COS X)]l/A - 2L -u lr, ^1 - 2L=2to2
4 + 111 ■ 2
□
6.B.36.   Compute the definite integrals
(a) Jo 77!=? ^
(b) /12t^t^;
W Jo
+
JO  \ e   +3      cos2 x
Solution. We have (a)
l
dx;
y = 1 — x2 dy = —2a; dx
I-
■dy ■
(b)
dx
= X -\- \/x2 — 1
dz
In^+^ = In (2 + V^);
2+%/3 1
(C)
/(
n v
e2x_|_3 cos:
+
p = e
dp = ex da;
^— ) da; = f 2° . o da; + f —^— da; =
iA x I -J  e   +3 j  cos^ x
7 0 0
e
J^dp+ [tgx}l =
e/%/3
7s
ds = —3dp
e/%/3
f  / ^ds + tgl = f [arctgsf/^ + tgl
1/V3
^ (arctg^-f) +tgl;
r:
?(r) = 2
1 +
1 - (x/r)2 = 2r[arcsina;]l_1 = 27rr.
da; = 2
i VT^¥
: dt
The result is of course well known from elementary geometry. Nevertheless, by using integral calculus, we derive the important fact that the length of a circle is linearly dependent on its diameter 2r. The number it is exactly the ratio, appearing in this dependency.
6.2.21. Areas and volumes. The Riemann integral can be used to compute areas or volumes of shapes denned by a graph of a function. As an example, calculate the area of a circle with radius r. The quarter-circle bounded by the func-
tion Vr
for 0 < x < r determines one quarter the
area. Use the substitution x = r sin t, dx = r cos t dt (using the corollary for I2 in the paragraph 6.2.6) to obtain by symmetry
7I-/2
(r) = 4 /   yr2 —x2 dx = 4r2
cos t dt
Jo Jo
fir/2 1 /•t/2
= 4r2 /     sin21 dt = -Ar2 /     cos2 t + sin2 t dt
o
0
= 2r2
ir/2
dt = Trr
It is worth noticing that this well known formula is derived from the principles of integral calculus. The area of a circle is not only proportional to the square of the radius, but this proportion is again given by the constant it.
Notice the ratio of the area to the perimeter of a circle.
7rr2 1 2~Kr ~ ~2'
The square with the same area has the side of length y^r and therefore its perimeter is 4y^rr. Hence the perimeter of a square with the area of the unit circle is 4y/7r, compared to the perimeter 2ir of the unit circle, which is about 0.8 less. It can be shown that in fact the circle is the shape with the smallest perimeter among all with the same area. We derive such results in comments about the calculus of variations in chapter 9.
Another analogy of this approach is the computation of the volume or the surface area of a solid of revolution. Such a set in R3 is denned by plotting the graph of a function y = f(x) (for a; in an interval [a, b]) in the plane xy and rotating this plane around the x axis. This is exactly what happens when producing pottery on a jigger - the hands shape the clay in the form ofy = f(x).
When computing the area of the surface, an increment Ax causes the area to increase by the multiple of the length As of the curve given by the graph of the function y = f(x) and the size of the circle with radius f(x). Hence the surface
add appropriate picture here!
416
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
□
6.B.37.   Prove that
20
0
Solution. Because
the geometric meaning of the definite integral implies
4 = I i dx < I im dx < I x9 dx = is
□
6.B.38. Without symbols of differentiation and integration, express
(}t5]n(t + l) dtj ,    xe (-1,1),
if the differentiation is done with respect to x. Solution. Integration is often thought of as the inverse operation to differentiation. In this problem, we'll use this "inverse-ness". The function
F(x) := ft5\n(t + l) dt,    x £ (—1,1) o
is clearly the antiderivative of function f(x) := x5 In (x + 1) on interval (—1,1), i.e. by differentiating it, we'll get exactly /. Hence
f0 \ ' fx \'
jt5ln(t + l) dt) = - lft5ln(t + l) dt) = —a5 In (x + 1) .
□
Improper integrals. 6.B.39.   Decide if
+ 00
f ^ dx £ R.
■J xwx
1
Solution. The improper integral represents the area of the figure between the graph of a positive function
arct£*      x > 1
^        x^/x ' —
and the x axis (from the left, the figure is bounded by the line x = 1). Hence the integral is a positive real number, or equals +oo. We know that
f<arctga;<f, ie[l,+oo). But that implies
+oo +oo +oo
f = f / ^dx< f ^§dx<l f x-idx = k,
1 1 1
area A(f) is computed by the formula
A(f) = 2v fb f(x)ds = 2Tr [b f(xWl + (f'(x))*dx,
where the differential ds is given by the increment on the length of curve y = f(x), see above. If instead we determine the solid of revolution by its bound parametrized in the xy plain by a pair of functions [x(t), y(t)], then the corresponding differential of the length s has the form ds = yJ{x'{t)Y + (y'(t))2dt. Thus we obtain
A = 2n ty(t)^(y'(t))2 + (x'(t))2dt.
When computing the volume of the same solid, then the increase of Ax causes the volume increase by a multiple of this increment and the area of the circle with radius f(x). Hence it is given by the formula
V(f)
(f(x))2dx.
As an example of using the formulas for surface and volume, we derive the well known formulas for the surface of the sphere and volume of the ball with diameter r.
1
Ar = 2tt
r\J\ — (xjr)2
V1 - (x/r)2
: dt
Vr
= 2-wr I   dt = 47rr
J —r
(r2 — x2) dx
^2 1 3""
r x — —x
Similarly to the circle, the ball is also the object which has the smallest surface are among all with a the given volume. That is the reason why soap bubbles almost always assume this shape.
6.2.22. Integral criterion for series. Using the improper integral, we can also decide the question of convergence for a class of infinite series:
Theorem. Let Y^=i f(n) ^e a series such that the function f : R —> R is positive and nonincreasing on the interval (1, oo). Then this series converges if and only if the integral
f(x)dx.
converges.
Proof. If the integral is interpreted as the area of a re-gion under the curve, the criterion is clear. Indeed, notice that the given series diverges 1    cfc     or converges if and only if the same is true for the ">'S—^     same series without the first summand. Moreover, by the monotonicity of f(x), there are the following estimates
417
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
i.e. in particular
+00
f 22*££ dxe
■J Xy/X
□
The formula (1) can be also used in a case when the function / is unbounded or the interval (a, 6) is unbounded. We speak of the so called improper integrals. For the improper integrals, the limits on the right hand side may be improper and may not exist at all. If one of the limits doesn't exist or we receive an expression 00 — 00, it means that the integral doesn't exist (00 — 00 doesn't have a character of an indefinite expression in this case). We say the integral oscilates. In every other case, we have the result (recall that 00 + 00 = +00, —00 — 00 = —00, ±00 + a = ±00 for a G R).
6.B.40. Determine
00
(a) J sin a; da;
1
00
1
4
(0
(d) / f?-
-1
Solution. Case (a). We can immediately determine
00
J sin a dx = [— cos a]?° = lim (— cos a) + cos 1.
Because the limit on the right hand side doesn't exist, the integral oscilates.
Cases (b), (c). Analogously, we can easily compute
00 00 00
r    dx    _ r     dx     _ f J___!_ j _
J   x4+x2  — J   x202+l) ~ J   X2        1+x2 —
1 11
[-7 -arctga]°° = lim (-± -arctga) + \ + arctg 1 =
x—>-oo
0—-+1+-=1—-
u      2^^4 4
and even more easily
/^ = [2V5]0=4-0 = 4, 0
where the primitive function is continuous from the right side at the origin (thus the limit equals the value of the function). Case (d). If we'd mindlessly compute
/£=[-£]-i=-l-l = -2. -1
we'd receive an obviously wrong result (a negative value while integrating a positive function). The reason why the Newton-Leibniz formula cannot be applied in this way is the
k »fc 00
** = £/(")< /   f(x)dx<J2f(n) =
71 = 2 Jl 77=1
Sk,
because s'k is the lower sum of the Riemann integral while is the upper sum. Thus, the integral converges if and only the series does, as expected. □
3. Sequences, series and limit processes
While building a menagerie of functions, we encountered power series, which extend the collection of all polynomials in a natural way, see5.4.3. We obtain a class of analytic functions in this way. But we have yet to prove that power series are continuous functions. We show below that not only are they continuous but it is possible to differentiate and integrate a power series term by term.
Moreover, functions often depend on further parameters which are dummy when differentiating or integrating, but we need to understand how the result behaves with respect to these parameters. For instance, what about the differentiability of the Gamma function introduced above? Or, when computing volume or area depending on a free parameter, how to minimize it?
Finally, in the end of this chapter we briefly introduce some more advanced concepts of integration.
6.3.1. How well behaved is a sequence of functions? We
return to the discussion of the limits of sell quences of functions and the sum of series of functions in view of the methods of differential and integral calculus. Consider a convergent series of functions
00
S(x) = J2fn(x) 77=1
on an interval [a,b]. Natural questions include:
• If all functions /77(a) are continuous at some point xq £ [a, b], is the function S(x) also continuous at the point
a0?
• If all functions /„ (a) are differentiable at some point a e [a, b], is the function S(x) also differentiable there and does the equality S' (a) = ^=1 fn(x) hold?
• If all functions /„ (a) are Riemann integrable on an interval [a,b], is the function S(x) also integrable there and does the equality    S(x)dx = J2n°=i la fn{x)dx hold? Notice, it does not matter whether we discuss series or sequences, since the former ones are just limits of the sequences of the partial sums. First, we demonstrate by examples that the answers to all three questions above are "NO!". Then we find additional conditions on the convergence of the series (or sequences) which guarantees the validity of all three statements. Later we shall mention alternative concepts of integration which are more satisfactory than the Riemann integral for an even wider classes of functions.
418
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
discontinuity of the given function at the origin. But if we use the additivity rule
b c b
J f(x)dx = J f(x) dx + J f(x) dx,
a a c
which always holds, if the integrals on the right hand side are
sensible, we'll find the correct result
101
r dx _ r dx ,  r dx _ ri_lu    i r   II1 _
J   x2 — J   x2 + J  x2 ~ L    xJ-l+L xJo_ -1 -1 0
lim (--) -1-1- lim (--) = oo - 2 + oo = +oo.
x-s-0- v    x/ X-S-0+ v x/
Note that the even character of function y = x~2 also implies
/ ii = 2/gf = 2-°o = +oo. -1 o
□
6.B.41.   Compute the definite integrals
(a) /0°
2
-2 J
JO (x+2)5
(b) J^2 In I a; I da;;
(d) Jl^dx;
(e) f, —r— dx.
Solution. We have (a)
OO
(b)
-i   lim (x + 2)"4 - 2"4 ) = -HO - 4) = ±
2 0 2
J In | x | da; = J In | a; | da; + J In | x | da; :
2 J In a; da; =
-F(x) = In a; G'(a;) = 1
G(x) = x
2 ([x]nx}20 - J 1 dx^j = 2 ([xlnx]2 - [x]2) =
2(21n2- lim (x\nx) - 2 + 0 ] = 41n2 - 4; V x^0+K J
(c)
(d)
o r_o-*ic
-1
-1
J ueu du
t = y/X
-- -2 f lim e"
u = 1/x du = —\ dx
= 2 / e"* dt =
i
- 2.
= — / ueu du =
F(u) = u	F'(u) = 1
G'(u) = eu	G(u) = eu
6.3.2. Examples of nasty sequences. (1) Consider the functions
fn(x) = (sin a;)™
on the interval [0, it]. The values of these functions are non-negative and smaller than one at all points 0 < x < it, except for x = 77, where the value is 1. Hence on the whole interval [0,7r], these functions converge to the function
fo   for all a;
f{x) = lim fn(x)
1   forx = §.
point by point. The limit of the sequence of functions /„ is a discontinuous function, even though all functions fn(x) are continuous. The problematic point is an inner point of the interval.
The same phenomenon occurs for a series of functions, because the sum is the limit of partial sums. Hence in the previous example, it suffices to express /„ as the n-th partial sum. For example, fi(x) = sinx, j2(x) = (sina;)2 — sina;, etc. The figure plots the functions fm(x) for m = n3, n =
1,
,10.
(2) We look at the second question, i.e. badly behaving derivatives. A natural idea on the same principle as above is to construct a sequence of functions which has the same nonzero derivative at one point, but becomes smaller and smaller. So they converge pointwise to the identically zero function.
The next figure plots the functions
Ux) = x(i-x2y
on the interval [—1,1] for values n = m2, m = 1,..., 10.
It is immediate that lim^oo fn(x) = 0 and that all functions fn(x) are smooth. Their derivative at x = 0 is
/;(0) = ((1 - x2)n - 2na;2(l - a;2)™"1) \x=0 = 1
for all n. But the limit function for the sequence /„ has a zero derivative at every point.
419
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
[u^}_l - J e?du= [u^}_l - [e"]_L =
- —   lim  u &u — - + lim
u—y — oo
u—y — oo
(e)
r dx
J  x In x
r = In a dr = — dx
In 2
In (In 2) — lim In r = In (In 2) + oo = +oo.
r—>-0+
6.B.42.   Compute the improper integrals
□
J x2 e x dx; j
dx
+e
Solution. Because the improper integral is a special case of a definite integral, we have at our disposal the basic methods to compute them. By integration by parts, we obtain
J x2 e x dx o
F{x) = x2 G'(x) = e~x
F'{x) = 2a G(x) = -e~x
[—a2 e x]^°+2 Jae x dx
F(x) = x G'(x) = e~x
F'{x) = 1 G(a) = -eTx
- lim §r + 2 [-a e~x]™ +2 J e~xdx =
0-2 lim 4+2 f-e-xln =0 + 2   lim -e"1 + 1=2
The substitution method then yields
e1 +e
■ dx ■
y = e" <iy = ex da
— oo —oo
[arctgy]^° = lim arctgy - 2, when the new limits of the integral are derived from the limits
lim  &x — 0,        lim &x — +oo.
oo
□
6.B.43. Compute
J x2n+1 e~x dx, iieN. o
Solution. We'll first solve this problem by the substitution method and then repeatedly apply integration by parts, yield-
ing
fx-
0
y = x
dy = 2a dx
±fyne-ydy-
F(y) = j," G'(y) = e-y
F'(y) = ny™"1 G(y) = -e-y
Ul-yne-y]o +njyn-1e-ydy) =
OO
f / y™"1 e-» dy =
(3) The counterexample to the third statement is in 6.2.18 already. The characteristic function xq 01 rational numbers can be expressed as a sum of countably many functions, which are numbered exactly by rational numbers. They are zero everywhere except for the single point after which they are named for, where the value is 1. Riemann integrals of all such functions are zero, but the sum is not a Riemann inte-grable function.
This example illustrates the fundamental flaw of the Riemann integral, to which we return later.
We present an example where the limit function / is in-tegrable, all functions /„ are continuous, but the value of the integral is not the limit of the integrals of /„. We modify the
sequence of the functions a(l
used above. They inte-
grate to Jq1 a(l — a2)™ dx = 2(n+i) ■ Thus> we consider the functions
/„(a) = 2(n + l)a(l-a2)™.
These functions with n diagram.
1,.
, 10 are on the next
We verify that the values of these functions converge to zero for every a e [0,1] (for example ln(/„(a)) —> —oo). But for all n
fn(x) dx = 1^0.
6.3.3. Uniform convergence. A reason of failure in all three previous examples is the fact that the speed of pointwise convergence of values /n(a) —> /(a) varies dramatically point from point. Hence a natural idea is to confine the problem to cases where the convergence will have roughly the same speed over all the interval:
Uniform convergence
Definition. We say that the sequence of functions fn(x) converges uniformly on interval [a, b] to the limit /(a), if for every positive number e, there exists a natural number N e N such that for all n > N and all a e [a, b] the inequality
\fn(x) - /(a)| < e
holds.
A series of functions converges uniformly on an interval, if the sequence of its partial sums converges uniformly.
420
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
2
n(n—1
F(y) = y™-1 G'(y) = e-y
F'(y) = (n-\)y™-2 G(y) = -e-y
n-l -y
.-V     ~ Jo
oo
2 Jyn-2e-ydy
]0°° + (n-l)Jy™-2e-ydy
F(y) = y G'(y) = e-y
■■ = ^^fye-ydy
F'(y) = 1 G(y) = -e-y
Albeit the choice of the number TV depends on the chosen e, it is independent on the point x e [a, b]. This is the difference from pointwise convergence, where TV depends on both e and x. We visualise the definition graphically in this way: if we consider a zone created by a translation of the limit function f(x) to f(x) ±e for arbitrarily small, but fixed positive e, all of the functions /„ (x) will fall into this zone, except for finitely many of them. The first and the last of the nasty exam- ,picture pies above do not have this property; In the second example, the sequence of derivatives fn lacked it.
□
6.B.44. In dependancy on a e K+ determine the integral Lengths, areas, surfaces, volumes.
6.B.45. Determine the length of the curve given parametri-cally:
x = sin2 t,    y = cos2 t, fori G [0, ■§].
Solution. According to ??, the length of a curve is given by the integral
^(x'(t))2 + (y'(t))2At
V'(sin2i)2 + (-sin2i)2di V2 sin 2tc\t = V2.
If we realize that the given curve is a part of the line y = I — x (since sin21 + cos21 = 1) and moreover the segment with boundary points [0,1] (for t = 0) and [1,0] (for t = f), we can immediately writw its length \/2. □
6.B.46.
cally:
fori
Solution.
mula ??:
/ =
Determine the length of a curve given parametri-
x = t2,   y = t3
e [0,V5].
. We'll again determine the length / by using the for-
rV5 _ rV5
/ \/At2 + 9t4 dt = / tV9t Jo Jo
V9u~Rdt=-[(9u + A)2]50 =
2 + 4dt
335 ~27
6.3.4.
The three claims in the following theorem say that all three generally false properties discussed in 6.3.1 are true for uniform convergence (but beware the subtleties when differentiating).
Consequences of uniform convergence
Theorem. (1) Let fn{x) be a sequence of functions continuous on a closed interval [a, b] and converging uniformly to the function f(x) on this interval. Then f(x) is also continuous on the interval [a,b\.
(2) Let fn{x) be a sequence of Riemann integrable functions on a finite closed interval [a, b] which converge uniformly to the function f(x) on this interval. Then f(x) is Riemann integrable, and
f(x)dx = I   I lim fn(x) I dx = lim   / fn(x)dx.
□
421
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.B.47. Determine the area to the right of the line x = 3 and bounded by the graph of a function y = -tzt and me x
Solution. The are is given by the improper integral J3°° jTzrf da;. We'll compute it using decomposition into partial fractions:
1
Ax + B
x°
+
C
x
X = 1
1 = C-B 0 = A + C
X2 + X + 1      x — 1
(Ax + B)(x-1) + C(x2 + x + 1), 1
C ■ B
A-
2
"3' 1
"3
and we can write
1
x3 - 1
dx = —
x + 2
x — 1     x2 + x + 1
dx.
Now we'll seperately determine the indefinite integral
/ x*+x+i d^
a; + 2
x2 + x + 1
■ da;
x +
(x+\)2 + \
dx +
(x + ±)2 +
■ dx
substitution at the first integral
t = X2 +X + 1
dt = 2(x+ ±)dx 1
" 2
1 , 3 ldt+2
(x + |)2 + f
substitution at the first integral
S = X + |
ds = da; i ln(a;2 + x + 1) + |
1        2 „ 34
-ln((a;2+a; + l) + --
substitution at the second integral
du = -^s ds J-Hx2 + x + l) + 2^
sz +
ds
4r« +1
■ds
+ 1
■ du
- ln(a;2 + x + 1) + Vh~ arctan(u)
1 , , 2 ,      r (2x + 1
- mfa; + x + 1) + v 3 arctan -=—
2 V V3
(3) Let /n(a;) be a sequence of functions differentiable on a closed interval [a, b] and assume fn(xo) —> /(^o) at some point a;0 G [a, 6]. Moreover, assume all derivatives gn(x) = j'n(x) are continuous and converge uniformly to the function g(x) on the same interval. Then the function j(x) = g(t) dt is differentiable on the interval [a,b], the functions /„ (x) converge to /(x) and f'(x) = g(x). In other words,
~T~f(X) = T-(   lim fn(x)) =   lim (-^-fn(x)
ClX ClX \ra-s-oo /      n-s-oo \ dx
Proof of the first claim. Fix an arbitrary fixed point x0 e [a, b] and let e > 0 be given. It is required to show that
\f(x)-f(x0)\ < e
for all x close enough to x0. From the definition of uniform convergence,
\fn(x) - f(x)\ < e
for all x e [a, b] and all sufficiently large n. Choose some n with this property and consider S > 0 such that
\fn(x) - fn(x0)\ < £
for all x in ^-neighbourhood of x0. That is possible because jnix) are continuous for all n. Then
\f(x) - f(x0)\ <\f(x) - fn(x)\ + \fn(x) - fn(x0)\
+ \fn(x0) - f(x0) \ < 3e
for all x in the ^-neighbourhood of x0. This is the desired inequality with the bound 3e. □
Remark. In fact, the arguments in the proof show a more general claim. Indeed, if the functions fn(x) converge uniformly to j(x) on [a, b], and the individual functions fn(x) have the limits (or one-sided limits) limx^Xo jn(x) = an, then the limit Yaax^,Xa f(x) exists if and only if the limit linijj^oo an = a exists. Then they are equal, that is,
a = lim    lim fn(x)   = lim    lim fn(x]
n—voo \ x—vxq I x—vxq \ n—voo
The reader should be able to modify the above proof for this situation.
6.3.5. Proof of the second claim. The proof of this part of , the theorem is based upon a generalization of the properties of Cauchy sequences of numbers to uniform convergence of functions. In this way we can work with the existence of the limit of a sequence of integrals without needing to know the limit.
422
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
In total, for the improper integral we can write:
1
c3 - 1
da;
1
= — lim
O 5—soo
In la; — 1
1 ,    ,   J , r- 2x + 1
--In (a; + x + 1) — v 3arctan -=—
2 V VŠ
= — lim ( — In |(5 — 11
3 s—>oo y3
- i ln(á2 + S + 1)- V7! arctan (^Jřf
1 1 7
--In 2 H— In 13 H--arctan —=
3 6 3 ^
= - In 13 H--= arctan —=--In 2--ir.
6 y/3 y/3    3 6
□
6.B.48. Determine the surface and volume of a circular paraboloid created by rotating a part of the parabola y = 2x2 for x e [0,1] around the y axis.
Solution. The formulas stated in the texts are true for rotating the curves around the x axis! Hence it's necessary either to integrate the given curve with respect to variable y, or to transform.
f2 „
V
S   = 2tt
■ áx = tt
+ — áx = 2ir 2 V       8a; ,,0
17WŤ-1
24
x     1 j 2 + 16 ŮX
□
6.B.49. Compute the area S of a figure composed of two parts of plane bounded by lines x = 0, x = 1, x = 4, the x axis and the graph of a function
Solution. First realize that
^=<o,  xe[0,i),     j=>0, ze(i,4]
and
lim
x—Vl-
il i —oo, lim , fx-l x^l+ VZ-l
The first part of the figure (below the x axis) is thus bounded by the curves
y = 0,    x = 0,    a; = 1,    y = with an area given by the improper integral
+oo.
Uniformly Cauchy sequences
Definition. The sequence of functions fn(x) on interval [a, b] is uniformly Cauchy, if for every (small) positive number e, there exists a large natural number N such that for all
x e [a, b] and all n > N,
\fn(x) - fm(x)\ < e.
Every uniformly convergent sequence of function on interval [a, b] is also uniformly Cauchy on the same interval. To see this, it suffices to notice the usual bound
\fn(x) - fm(x)\ < \fn(x) - f(x)\ + \f(x) - fm(x)\
based on the triangle inequality.
Before coming to the proof 6.3.4(2), we mention the following:
Proposition. Every uniformly Cauchy sequence of functions fn(x) on the interval [a, b] uniformly converges to some function f on this interval.
Proof. Of course, the condition for a sequence of functions to be uniform Cauchy implies that also for all x e [a,b], the sequence of values /„ (a;) is a Cauchy sequence of real (or complex) numbers. Hence the sequence of functions fn(x) converges pointwise to some function j(x).
Choose N large enough so that
\fn(x) - fm(x) \ < E
for some small positive e chosen beforehand and all n > N, x e [a, b]. Now choose one such n and fix it, then
\fn(x) - f(x)\= lim
m—yoo
fn(x) - fm(x)\ < E
for all x e [a, b]. Hence the sequence fn(x) converges to its limit uniformly. □
Proof of the second claim in 6.3.4. Recall that every uniformly convergent sequence of functions is also uniformly Cauchy and that the Riemann sums of all single terms fn(x) of the sequence converge to fn (x) dx independently of the choice of the partition and the representatives. Hence, if
\fn(x) - fm(x) \ < B
for all x e [a, b], then also
fn(x) dx -
fm(x) dx < e\b — a\
Therefore the sequence of numbers J fn(x) dx is Cauchy, and hence convergent.
The Riemann sums of the limit function j(x) can be made arbitrarily close to those of fn(x) for large n, by the same argument as above. So j(x) is integrable. Moreover,
fn(x) dx-     f(x) dx
J a
so the limit value is as expected.
< e\b — a\,
□
423
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Si = -J dx; o
while the second part (above the x axis), which is bounded by the curves
y = o,
has an area of Since
1,
4, y
S2 = J^dx.
Ijtfdx = li/{x—TY+C, the sum 5*1 + 5*2 can be gotten as
s = -JiT1_(§\/(^T^-§) +
Jm (|^9-|y(^l7) = | (1 + #9) .
We have shown among other things, that the given figure has a finite area, even though it's unbounded (both from the top and the bottom). (If we approach x = 1 from the right, eventually from the left, its altitude grows beyond measure.) Recall here the indefinite expression of type 0 ■ oo. Namely, the figure is bounded if we limit ourselves to x e [0,1 — S] U [1 + S, 4] for an arbitrarily small S > 0. □
6.B.50. Determine the avarage velocity vp of a solid in the time interval [1,2], if its velocity is
v(t) = te\l,2]
Omit the units.
/i+t2
Solution. To solve the problem, it suffices to realize that the sought avarage velocity is the mean value of function v on interval [1,2]. Hence
^=2-1/71+7* dt = J^dx = V5-V2,
with 1 + t2 = x, t dt = dx/2.
□
6.B.51. Compute the length s of a part of the curve calles tractrix given by the parametric description
f(t) = r cost + r In (tg |) ,    g(t)=rsint, te[Tr/2,a],
where r > 0, a e (j/2, it).
Solution. Since
/'(f) = -rsini +
2tg f-cos2
= — r sin t +
6.3.6. Proof of the third claim. For the corresponding result about derivatives, extra care is needed regarding the assumptions:   If the functions
jn(x) = jn(x) — fn(x0) are considered instead of fn(x), the derivatives do not change. Hence without loss of generality it can be assumed that all functions satisfy fn(x0) = 0.
Then one of the assumptions of the theorem is satisfied automatically. For all x e [a, b], we can write
fn(x) = / gn(t)dt.
■J xo
Because the functions gn converge uniformly to g on all of [a, 6], the functions fn(x) converge to
/(*)
git) dt.
g is a uniform limit of continuous functions, thus g is again continuous. By 6.2.8, for the relations between the Riemann integral and the primitive function, the proof is finished.
6.3.7. Uniform convergence of series. For infinite series, the corresponding results follow as a corollary in this way:
Consequences for uniform convergence of series
Theorem. Consider a sequence of functions fn{x) on interval [a,b\.
(1) If all the functions fn{x) are continuous on [a,b] and the series
converges uniformly to the function S(x), then S(x) is continuous on [a,b\. (2) If all the functions fn{x) are Riemann integrable on la, b] and the series
s(x)= E-Wa
uniformly converges to S(x) on [a, b], then S(x) is integrable on [a, b] and
j(x) dx.
--    Vi=l / n=lJa
(3) If all the functions fn (x) are continuously differentiable on the interval [a, b], if the series S(x) = Y2^=i fn{x) converges for some x q G [a,b], and if the series T [x) = Y^=i fn(x) converges uniformly on [a, b], then the series S(x) converges. S(x) is continuously differentiable on [a, b] and S' (x) = T(x). That is:
g'(t) = r cos t on interval [tt/2, a], for the length s we get
d
dx
dx
424
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
tt/2
r2 cos4« + r2 cos2 tdt= j
ir/2
r2 cos2 t sin2 í
-r I §f|dí = -r[ln(sinC/2 = -rln(sina).
ir/2
□
6.B.52. Compute the volume of a solid created by rotation of a bounded surface, whose boundary is the curve 2 4 — 9x2 + y4 = 0, around the x axis.
Solution. If [x, y] is a point on the 2 4 — 9x2 + y4 = 0, clearly this curve also intersects points [—x,y], [x,—y], [—x,—y]. Thus is symmetric with respect to both axes x, y. For y = 0, we have x2 (x — 3) (x + 3) = 0, i.e. the x axis is intersected by the boundary curve at points [—3,0], [0,0], [3,0]. In the first quadrant, it can then be expressed as a graph of the function
f(x) = $9x2-x4,   x e [0, 3].
The sought volume is thus a double (here we consider x > 0) of the integral
3 3
/ irf2(x) dx = tt f V9x2 — x4 dx. o o
Using the substitution t = $9 — x2 (xdx = —tdt), we can easily compute
f V9x2 - x4 dx = fx - V9 - x2 dx = - ft2dt = 9,
0 0 3
and receive the result 187r. □
6.B.53. Torricelli's trumpet, 1641. Let a part of a branch of the hyperbola xy = 1 for a; > a, where a > 0, rotate around the x axis. Show that the solid of revolution created in this manner has a finite volume V and simultaneously an infinite surface S.
Solution. We know that
+ 00
+ 00
V = n f (±Ydx = n f ±dx.
lim
x—s+oo
and
S = 2- +f \- ^l + {-^fdx = 2n dx
a a
+°° ( \
> 2ir  f - dx = 2ir I   lim In x — In a } = +oo.
a    X \x^+oo )
The fact the the given solid (the so called Torricelli's trumper) cannot be painted with a finite amount of color, but can be filled with a finite amount of fluid, is called Torric-celli's paradox. But realize that a real color painting has a
6.3.8. Test of uniform convergence. A simple way to test that a sequence of functions converges uniformly is to use a comparison with r?+ the absolute convergence of a suitable se-////■ quence of numbers. This is often called the Weierstrass test.
Suppose a sequence of functions fn(x) is given on an interval I = [a,b] satifying
|/,7(20l < an e R
for suitable real constants an and for all x e [a,b]. Let
Sk(x) = _>_ /n(a
for distinct indices k. For k > m,
k
\sk(x) - Sm(x)\ =      ^ fn(x)
77 = 777 + 1
k k
<     ___     l/n^)! <     ^ ak-77=777+1 77=777+1
If the series of the (nonnegative) constants ___ttLi a77 is con" vergent, then the sequence of its partial sums is a Cauchy sequence. But then the sequence of partial sums sn(x) is uniformly Cauchy.
By 6.3.5 the following is verified:
The Weierstrass test
Theorem. Let fn(x) be a sequence of functions defined on interval [a, b] with 1/77(2)] < an G R.
If the series of numbers ___ttLi a77 'j convergent, then the series S(x) = ___ttLi fn(x) converges uniformly.
6.3.9. Consequences for power series. The Weierstrass test has important results for power series
s(x) = ^2an(x- 20)
77 = 0
Hit
m
centered at a point 20.
We saw earlier in 5.4.8, that each power series converges on an entire interval (20 — S, xq + S). The radius of convergence S > 0 can be zero or 00. (see 5.4.12).
In the proof of theorem 5.4.8, a comparison with a suitable geometric series is used to verify the convergence of the series S(x). By the Weierstrass test, the series S(x) converges uniformly on every compact (i.e. bounded closed) interval [a, b] contained in the interval (20 — S, xq + S). Thus the following crucial result is proved:
425
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
nonzero width, which the computation doesn't take into account. For example, if we would paint it from the inside, a single drop of color would undoubtedly "block" the trumpet of infinite length. □
C. Power series
6.C.I. Expand the function ln(l + x) into a power series at point 0 and 1 and determine all x e K for which these series converge.
Solution. First we'll determine the expansion at point 0. To expand a function into a power series at a given point is the same as to determine its Taylor expansion at that point. We can easily see that
[ln(x + l)]<")^(-l)"+1/(n~\)!,
Differentiation and integration of power series
Theorem. Every power series S(x) is continuous and is continuously differentiable at all points inside its interval of convergence. The function S(x) is Riemann integrable and can be differentiated or integrated done term by term.
Abel's theorem states that power series are continuous even at the boundary points of their domain when they converge there (including eventual infinite limits). We do not prove it here.
The pleasant properties of power series also reveal limitations on the use in practical modelling. In particular, it is not possible to approximate piece-wise continuous or non-differentiable functions very well by using power series. Of course, it should be possible to find better sets of functions fn(x) than just the values fn(x) = xn, up to constants. The best known examples are Fourier series and wavelets discussed in the next chapter.
1) = In 1 +      a-n,xn, where
71=1
(-!)"+>-1)! _ (-1)
77+1
so after computing the derivatives at zero, we have ln(x +   6.3.10. Laurent series. We return to the smooth function
j(x) = e_1/x from paragraph 6.1.5 in the context of Taylor series expansions. It is not analytic at the origin, because all its derivatives are zero there and the function is strictly positive at all other points. At all points x0 ^ 0 this function is given by its convergent Taylor series with radius of convergence r = \x0\. At the origin the Taylor series converges only at the one point 0.
Replace x with the expression — 1/x2 into the power series for ex. The result is the series of functions
Thus we can write
\n(x + T)   =   x - ^x2 + ^x3 - ^x4 + ■
= y(-^>—xn
E
For the radius of convergence, we can then use the limit of the quotient of the following coefficients of terms of the power series
r=-' =-1=1.
77=0
lim^oo I -£ti I lim^oo
Hence the series converges for arbitrary a; e (—1,1). For a; = —1 we get the harmonic series (with a negative sign), for x = 1 we get the alternating harmonic series, which converges by the Leibniz criterion. Thus the given series converges exactly for a; G (-1,1].
Analogously, for the expansion at point 1, by computing the above derivatives from 6.C.1, we get
ln(a; + l)=ln(2) + ^(a;-l)-^(a;-l)2
The series converges at all points x ^ 0. It gives a good idea about the behaviour near the exceptional point x = 0.
Thus we consider the following series similar to power series but more general:
Laurent6 series
A series of functions of the form
oo
s(x) = E an(x ~ x°)n
77= —OO
is called a Laurent series centered at x0. The series is convergent if both its parts with positive and negative exponents converge separately.
The importance of Laurent series can be seen with rational functions. Consider such a function S(x) = j(x)/g(x) with coprime polynomials / and g and consider a root x0 of
+
1
,(x-iy
3-23
OO
ln(2) +
(-1T+1
n ■ 2r'
6Pierre Alphonse Laurent (1813-1854) was a French engineer and military officer. He submitted his generalization of the Taylor series into the Grand Prix competition of the French Academle des Sciences. For formal reasons it was not considered. It was published much later, after the author's death.
426
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
and for the radius of convergence of this series, we get
1
1
1.
The first series converges for —1 < x < 1, the second
for-l<a;<3. □
6.C.2. Expand the function cos2 (2) into a power series (i.e. determine its Taylor expansion) at point 0 and determine for which real numbers this series converges. O
6.C.3. Expand the function sin2 (a;) into a power series at point 0 and determine for which real numbers this series converges o
6.C.4. Expand the function ln(a;3+3a;2+3a;+l) into apower series at point 0 and determine for which x £ K it converges.
o
6.C.5. Expand the function In yfx into a power series at point 1 and determine for which x £ K it converges. O
6.C.6. On the interval of convergence (—1,1), determine the sum of the series
J2 n(n + l)xn.
77=1
Solution. We have
n(n + l)xn = n [x 77=1 77=1
J_ nxr' 77=1
77+iy _
e nx 77=1
77+1 _
e *n
z2 e (xny
71=1
00
-1 + e xn
77 = 0
(1-x)3
□
for all x £ (-1,1). 6.C.7. For a convergent series
<^n V^+100'
77=0
estimate the error of the approximation of its sum by the partial sum S9999. O
6.C.8. With an error lesser than 1/10, approximately compute
z /
10
Ina; dx.
o
6.C.9. Compute
lim
77—>-oo
1+Jr+\/1+-2r+"'+v'l+l
polynomial g (x). If the multiplicity of this root is s, then after multiplication we obtain the function S(x) = S(x) (x — xo)s, which is analytic on some neighbourhood of x0. Therefore we can write
S(x) = +
v   / (x-x0)T
+
+ a0 + ai(x — x0) + .
7l = —s
Consider the two parts of the Laurent series separately:
—l 00 S(x) = S-+S+ = an(x-x0)n + Yan(x-x0)n.
77=—00 77=0
For the series S+, Theorem 5.4.8 implies thatits radius of convergence R is given by R-1 = limsup^^ \J\an\. Apply the same idea to the series S- with l/x substituted for x. It is then apparent that the series S- (x) converges for\x—x01 > r, where r_1 = limsup^^^ \J\a-n\.
Notice that the conclusions about convergence remain true even for complex values of x substituted into the expression. Laurent series can be considered as functions defined on a domain in the complex plain. We return to this in chapter 9. The following theorem is proved already.
Convergence of the Laurent series on the annulus
Theorem. The Laurent series S(x) centered at x0 converges for all x £ C satisfying r < \x — x0\ < R and diverges for all x satisfying \x — x0\ < r or \x — x0\ > R, where
r_1 = lim sup \f\a-n\,    R-1 = lim sup \Z|an|.
The Laurent series need not converge at any point, because possibly R < r. If we look for an example of the above case of rational functions expanded to Laurent series at some of the roots of the denominator, then clearly r = 0 and therefore, as expected, it converges in the punctured neighbourhood of this point x0. i? is given by the distance to the closest root of the denominator. In the case of the first exam-pie, the function e_1/x , r = 0 and R = 00.
6.3.11. Numerical approximation of integration. Just as in paragraph 6.1.16, we use the Taylor expansion to propose simple approximations of integration. We deal with an integral I = fb f(x) dx of an analytic function f(x) and a uniform partition of the interval [a, 6] using points a = x0, xi,..., xn = b with distances Xi — = h > 0. Denote the points in the middle of the intervals in the partitions by xi+1/2 and the values of the function at the points of the partition by f(xi) = ji.
Compute the contribution of one segment of the partition to the integral by the Taylor expansion and the previous theorem. Integrate symmetrically around the middle values so
427
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
o
6.C.10. Apllications of the integral criterion of convergence. Now let's get back to (number) series. Thanks to the integral criterion of convergence (see 6.2.19), we can decide the question of convergence for a wider class of series: Decide, whether the following sums converge of diverge:
oo
a) T
' nlnn' n=l
oo
b) E
71=1
Solution. First notice, that we cannot decide the convergence of none of these series by using the ratio or root test (all lim-and lim ^fd7[ equal 1). Using the integral
its lim |^i±i criterion for convergence of series, we obtain: a)
1
■ da;
/    - dí = lim [ln(i);
!    x\n(x) Jo hence the given series diverges.
oo,
b)
; da; :
lim
i5 —>-oo
1,
hence the given series converges.
□
6.C.11. Using the integral criterion, decide the convergence of series
oo
E  (ti+1) ln^Ti+l) '
71=1
Solution. The function
f(X) = (x+l) In^x+l)'     X G I1' +°°)
is clearly positive and nonincreasing on its whole domain, thus the given series converges if and only if the integral f^~°° f(x) dx converges. By using the substitution y = In (x + 1) (where dy = dx/(x + 1)), we can compute
+oo +oo
S 77+TJT^T7+i) dx= I y* dy = hT5-
1 In 2
Hence the series converges. □ Uniform convergence.
6.C.12.   Does the sequence of functions
yn = e^,   i£l,   n e N converge uniformly on R?
Solution. The sequence {y^JnGN converges pointwise to the constant function y = 1 on R, since
that the derivatives of odd orders cancel each other out while integrating:
rh/2 [h/2   /  oo x
f(xi+1/2+t)dt= /      [T^f{n)(xt+1/2)tn )dt h/2 J-h/2\^0n\ J
h/2 ^
fc=0\J~h/2
h2k+l
fc=0
k/k){^+l,2)tkdt
22k{2k + l)
A simple numerical approximation of integration on one segment of the partition is the trapezoidal rule. This uses the area of a trapezoid given by the points [x{, 0], [x{, f{], [0,a;j+i], [xi+i,fi+i] for approximation. This area is
P = \(fr+fr+i)h
. In total, the integral I is approximated by
n — 1 ,
/trap = P* = 2 (/° + 2/1 + " ' + 2fn~1 + In)-
i=0
Compare /trap to the exact value of I computed by contributions over individually segments of the partition. Express the values fi by the middle values fi+i/2 and the derivatives
fi+1/2 m     following way:
h h2 /j+i/2±i/2 = fi+1/2 ± 2 fi+1/2 + 2122 f"^ + -1-/2)
±3&/(3)(i + 1/2) + --" Thus, the contribution Pi to the approximation is
Pi = \{h+h+i)h = h(fi+1/2+^f"(i+l/2))+0(h5).
Estimate the error Ai = I — /trap over one segment of the partition:
a = h(ft+i/2 + yj:+i/2 - f 1+1/2 - ^n+i/2+o(h4))
h3
= -Y2-n+i/2+o(h5).
The total error is thus estimated as
|J-Jtrap| = ^n/i3|/"|+nO(/i5)
12
(b-a)h2\f"\ + 0(hA)
where |/"| represents an upper estimate for \ f"(x) | of / over the integral of integration.
If the linear approximation of the function over the individual segments does not suffice, we can try approximations by quadratic polynomials. To do so, three values are always needed, so work with segments of the partition in pairs. Suppose n = 2m and consider x{ with odd indices. We choose
fi+i = f(xi + h) = fi+ aih + fiih2
fi-i = f(xi ~h) = fi-      + fiih2
428
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
lim e«
77 —SOO
x e.
But the computation
yn (V2n) = e > 2   for all n G N implies that it's not a uniform convergence. (In the definition of uniform convergence, it suffices to consider e G (0,1).) □
6.C.13.   Decide whether the series
oo _
E^/x-n 774+X2
77 = 1
converges uniformly on the interval (0, +oo). Solution. Using the denotation
Mx) = $£,   x>0,   n G N,
we have
From now on, let n G N be arbitrary.  The inequalities
f'n{x) > 0 for a; G (0,n2/V3) and f'n{x) < 0 for x G (n2/^, +oo) imply that the maximum of function /„ is attained exactly at the point x = n2/\/3. Since
•/n \ \/3 / _  4™2 4n2   —    4     2^  772  ^ +txJ'
V 7 77=1 77 = 1
according to the Weierstrass test, the series Y^n°=i fn(x) converges uniformly on the interval (0, +oo). □
6.C.14.   For x G [—1,1], add
Z-< 77(77+1)
77+1
Solution. First notice that by the symbol for an indefinite integral, we'll denote one specific primitive function (while preserving the variable), which should be understood as a so called function of the upper limit, while the lower limit is zero. Using the theorem about integration of a power series for x G (—1,1), we'll obtain
„00 (-1)"+
Z^77=l 77(77+1
77+1
E
(-1)
77+1
dx
= /E~i((-i)n+1/^n-1^)^
HJi:Zi(-zr-1dx)dx = f(fi-x+:
dx) dx = J (J      dx) dx =
Since
E
(-1)
77+1
xn ) dx = / In (1 + x) + d dx,
which implies
Pi = ^2 + ~~ 2/*)-
The approximation of the integral over two segments of the partition between Xi-i and 2^+1 is now estimated by the expression (notice we integrate the quadratic polynomial with the requested values fi-i, f, f+i in the points Xi-i, Xi, xi+1, respectively. It is not necessary to know the constant
Pi =
fi + att + Pit2 dt = 2hf + -fctY h o
2/i/i + ^(/i+i + /i-i-2/i)
h
6
'-(f+1 + f-1+4f).
I^     -^Simp I
This procedure is called Simpson's rule1. The entire integral is now approximated by
^ n—l n—1
777+1 + 2 hm + f2n)-
777=0 777=1
As with the trapezoidal rule above, the total error is estimated by
= ^(&-a)/i4|/(4)l + 0(/i5),
where f^ represents the upper bound for f^ (x) over the interval of integration.
6.3.12. Integrals dependent on parameters. When inte-
'St* 8raun8 a function f(x,y1,... ,yn) of 1 real variable x depending on further real parame-ssSlgg^ ters yi,...,yn with respect to the single variable x, the result is a function F(yi,..., yn) depending on all the parameters. Such a function F often occurs in practice. For instance, we can look for the volume or area of a body which depends on parameters, and determine the minimal and maximal values (with additional constrains as well). Often it is desirable to interchange the operations of differentiation and integration. That this can be done is proved below. We begin with an examination of continuous dependency on the parameters.
For sake of simplicity, we shall deal with functions f(x,y) depending on two variables, x G [a,b], y G [c,d\. We say / is continuous on / [a,b] x [c, d] C K2 = C if for each z = (x,y) from the domain of / and e > 0 there is someS > such \f(w) — f(z)\ < eiiw G 0$(z). (Noticethe definition is the same as with the univariate functions, just we use the distance in the plane.)
The function f(x, y) is called uniformly continuous if for each e > 0, there is S > 0 such that for any two points z, w in I C K2 = C, \z-w\ < S implies \f(z)-f(w) < e. Exactly the same argument as with univariate functions, based on the fact that every open cover of a compact set in the complex
we know from the continuity of the given functions that
This way of approximating the integral is attributed to the English mathematician and inventor Thomas Simpson (1710-1761).
429
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
(-1)
E
71=1
The choice a Next,
■ xn = \n(l+x) + C\, ae(-l,l). = 0 then yields 0 = In 1 + Cu i.e. d = 0.
fin (1 + a) dx = | per partes
u = In (1 + x) v' = 1
1+2
V = X
= x In (1 + x) — f j^tt dx = x In (1 + x) — f 1 — dx = x In (1 + a) - x + In (1 + x) + C2 = (x + 1) In (x + 1) -a + C2-Since the given series converges at the point x = 0 with a sum of 0, analogously as for C\ ,
0 = l-lnl-0 + C2 implies that C2 = 0. In total, we have for a; e (—1,1):
„71+1 -
(x + 1) In (x + 1) - :
Moreover, according to Abel's theorem (see 6.3.9), the sum of the given series equals the (potentially improper) limit of the function (a + 1) In (a + 1) — x at points —1 and 1. In our case, both limits are proper (at point 1, the function is even continuous and the value of the limit at point 1 then equals the value of the function 2 In 2 — 1.) For computing the value of the limit at point —1, we'll use L'Hospital's rule:
lim (x + 1) In (a + 1)
x—y — 1+
,•      mf -, = lim —|--h 1 = lim -
t—S-0+ j t—S-0+ -= lim -t + 1 = 1.
4-S-0+
lim tmt + 1
t—S-0+
+ 1
Of course, the convergence of the series at points ±1 can
be verified directly. It's even possible to directly deduce that
00
E -njnTTj = 1 (by writing out m^+T) = i - ikl- D
71=1
6.C.15. Sum of a series. Using theorem 6.3.5 "about the interchange of a limit and an integral of a sequence of uniformly convergent functions", we'll now add the number series
00 1
71 = 1
dx
We'll use the fact that / ^
1
n.2"
Solution. On interval (2,00), the series of functions J2n°=i T^tt converges uniformly. That is implied for example by the Weierstrass test: each of the function —k+r is decreasing on interval (2, 00), thus their values are at most tjt^t; the series E^i w+r is convergent though (it's a geometric series with quotient 5). Hence according
plane contains a finite subcover, cf. Theorem 5.2.8(5), provides the following lemma (cf. the proof of Theorem 6.2.11).
Lemma. Each continuous function f(x,y) on I = [a,b] x [c, d] is uniformly continuous.
Now we are ready for the following important claim:
Theorem. Assume /(a, y) is a function defined for all a lying in a bounded interval [a, b] and all y in a bounded interval [c, d], continuous on I = [a, b] x [c, d}. Consider the (Riemann) integral
F(y)= / f(x,y)dx.
J a
Then the function F(y) is continuous on [c, d}.
Proof. Fix a point y G [c, d], small e > 0, and choose ! .ji * „ a neighbourhood W of y such that for all y e W C [c, d] and all a G [a, b] (remember / is uniformly con-
tinuous)
\f(x,y) - f(x,y)\ < e.
The Riemann integral of continuous functions is evaluated by approximations of finite sums (equivalently: upper, lower, or Riemann sums with arbitrary representatives see paragraph 6.2.9).
The goal is to establish that the Riemann sums for the integrals with parameters y and y cannot differ much. In the following estimate for any partition with k intervals and representatives £j, first use the standard properties of the absolute value and then exploit the choice of W:
k-l
k-1
Xi+l
k-l
Xi+l
- ^2\f(&y) - f(^y)\(xi+i -xi)
i=0
< e(b - a).
It follows that the limit values for any sequences of the partitions and representatives F(y) and F(y) cannot differ by morethane(&—a) either, so the function F is continuous. □
6.3.13. Integrating twice. The fact that the integral F(y) =
fa fix, y)dx of a continuous function / : [a, b] x [c, d] —> R in the plane is again a continuous function F : [c, d] —> R allows us to repeat the integration and write
fd   ph fd fb
(1)   1=1    I   f(x,y)dxdy=      (/ f{x,y)dx)dy.
J c   J a J c     J a
The next theorem is the simplest version of the claim known as Fubini theorem.
430
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
to the Weierstrass test, the series of functions J2n°=i x"^tt Fubini theorem
converges uniformly.   We can even write the resulting
Theorem. Consider a continuous function f :  [a, b] x function explicitly. Its value at any x e (2, oo) is the value    ^ d] ^ R    ^ plam r2 ^ muMple integmtion (1) is
of the geometric series with quotient ^, so if we denote the weU defined and does not depend on the order of integration, limit by / (x), we have i-e.,
1    - 1     1    -      1 / = /  (/   f(x,y)dx)dy= /  (/ f(x,y)dy)dx..
J a;n+1     x2 1 — —     x(x — 1) c     a a c
77=1 x V )
By using (6.3.7) (3), we get
oo   ^       00   f00 da;       r°° I °°    1   \ Proof. We know / is uniformly continuous on the prod-
E^ = E /   T^tt = /    IE T^tt ) da; uct of intervals [a,b]x [c, d] in the plane. Thus, for each e > 0
n=i          77=1-'2              -'2    \t7=i       / there is 5 > 0 such that\f(x1, yi)—f(x2, y2)\ < £ whenever
1      ,       ,      fs    1       1 ,       \xi - x2\< 8 and \yi - y2\ < 5.
■ dx = lim  /--- da;
2   x(x — 1)        /-^oo J2 x — 1    x We know both Rieman integrals in (1) exist, thus we
lim [(ln(<5 - 1) - ln(<5) - ln(l) + In 2]
may fix a sequence Ek of partitions of the interval [a, b] into
<5^oo k subinterval [xi-i,Xi] of equal size 1/k and with repre-
lim
8—too
In
5-1 S
§ _ l\ j = 1,... ,k. Then we may write
+ln(2)
sentatives i = 1,..., k, and similarly for the interval [c, d] with the subintervals [yi-\, yi] and representatives 77^,
In ( lim ) + In 2 = In 2
1 <5^-oo
□ 1 = j" \mJ^f{^k,y)\{b-a)yy
6.C.16.   Consider function j(x) = J2n°=ine nx. Deter-   if I < 5, then mine
/•In 3 rb
/ /Wda;.
Jin 2
Solution. Similarly as in the previous case, the Weierstrass test for uniform convergence implies that the series of functions Y^n°=i ne~nx converges uniformly on interval (In 2, In 3), since each of the functions ne~nx is lesser than Tpr on (In 2, In 3) and the series J2n°=i 3^ converges, which can be seen for example from the ratio test for convergence of series: Thus, the convergence of
f(x,y)dy-		i=l
k ^E i=l	f j Xi-	f(x, y) dx - /(Ci,fc, y)\{b- a) - 1
k <E i=l	r j Xi-	\f(x,y) - f(&,k,y)\dx 1
< e(b	-a)	
lim
n—yoo
an+1  _ ^ (n + l)2-(^ = ^ 1 n+1 = 1_
an 77—s-oo        n2n 77—s-oo 2   71 2 In total, according to (6.3.7) (3), we have
In 3 /-In 3   00 00     pin 3
In 2 Jin 2 „_, Jln2
^n^2mi,k,y)^(b-a.)=F(y)= / f(x,y)dx
k ...
i=l
is uniform on [c, d]. In particular, we may swap the integral and the limit to obtain
f(x)dx=    y,ne~nx = h
jln2     ,7=1 ,7=1
= D-«-M]isi = E   - ±) = i - \ = \ I = ^fcd(Zf&*>y)fr - ^
71=1                                          71=1    ^                           ' i — 1
□ =^o^2ifdm,k,y)dy)Ub-a)
6.C.17.   Determine the following limit (give reasons for the _  ^m y~^y~^j^ ^^ll^_a^(j
procedure of computation): fc/^oo     ^.=i k £
i|m   /"    cos \ 77 J ^ Clearly, the same result will appear if we swap the order of
(! + -)" the integration. □
77—>-00
431
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Solution. First we'll determine lim
n—voo
)" ■ The sequence of these functions converges pointwise and we have
\ n >
lim
n—¥oo i \ _|_ — j
l
lim   1 + I)
(77) J_
ex
It can be shown that the given sequence converges uniformly. Then according to (6.3.5),
lim
n—too
1 + 5)
■ da;
•--^
n—too (1 _|_ ^ J
'1=1
ex
dx
We leave the verification of uniform convergence to the reader (we only point out that the discussion is more complicated than in the previous cases).
□
6.C.18. By using differentiation, obtain the Taylor expansion of function y = cos x from the Taylor expansion of function y = sin x centered at the origin. O
6.C.19. Find the analytic function whose Taylor series is
x - i a;3 + i x5 - j xr + ■
for a; G [-1,1].
O
6.C.20. From the knowledge of the sum of a geometric series, derive the Taylor series of function
y ~ 5+2x
centered at the origin. Then determine its radius of convergence, o
6.C.21. Expand the function
y ~ 3^2^' x G (—§' I)
to a Taylor series centered at the origin. O
6.C.22. Expand the function cos2 (a;) to a power series at the point 7r/4 and determine for which x G K this series converges, o
6.C.23. Express the function y = ex denned on the whole real axis as an infinite polynomial with terms of the form an(x — \)n and express the function y = 2X denned on R as an infinite polynomial with terms anxn. O
6.C.24. Find a function / such that for x G K, the sequence of functions
n2x2-\-l '
n G N
6.3.14. Differentiation in the integrals. We are ready to discuss the differentiation of integrals with respect to parameters. The following result is ex-p?T~> tremely useful. For instance we shall use it in the next chapter when examining integral transforms.
Differentiation with respect to parameters
Theorem. Consider a continuous function f(x,y) defined for all xfrom a finite interval [a, b] and for all y in another finite interval [c, d], a point c G [c, d], and the integral
F(y)= f f(x,y)dx.
■j a
If there exists the continuous derivative -£^f on a neighbourhood of the point c, then -jj^F(c) exists as well and
d /*k d
Proof. By the assumed continuity of all functions and the already known continuous dependence of integrals on parameters, some knowledge about univariate antiderivatives can be used. The result is then a simple consequence of the Fubini theorem. Denote
G(y)= I ^-f(x,y)dx, F(y) Ja ay
f(x,y)dx
and compute, invoking Fubini theorem, the antiderivative
dz
H(y) = /   G(z) dz =
rb / ry d \
— fix, z) dz ] dx dzjy     ' I
f(x, z) dx I dz
= / (f(x,y) - f(x,yo)) dx
J a
= F(y)-F(y0). Finally, differentiating with respect to y yields
as desired.
□
6.3.15. The Riemann-Stieltjes integral. To end this chapter, we mention briefly some other concepts of integration. Mostly we confine ourselves to remarks and comments. Readers interested in a thorough explanation can find another source.
First, a modification of the Riemann integral, which is useful when discussing probability and statistics. In the discussion of integration, we summed infinitely many linearized (infinitely) small increments of the area given by a function j(x). We omitted the possibility that for different values of x we could take the increments with different weights. This
432
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
to it. Is this convergence uniform on R? O 6.C.25. Does the series
oo
E ^fe, kde converge uniformly on the whole real axis? O
6.C.26. By using differentiation, obtain the Taylor expansion of function y = cos a from the Taylor expansion of function y = sin x centered at the origin. O
6.C.27. Approximate
(a) cosine of ten degress with a precision of at least 10~5;
(b) the definite integral J^2 x42^ with a precision of at least 10"3.
o
6.C.28. Determine the power expansion centered at x0 = 0 of function
X
f(x) = J e*2 dt, i6R. o
o
6.C.29. Find the analytic function whose Taylor series is
x - tj x3 + i x5 - j x7 + ■
for a; G [-1,1]
O
6.C.30. From the knowledge of the sum of a geometric series, derive the Taylor series of function
y ~ 5+2x
centered at the origin. Then determine its radius of convergence, o
6.C.31. Use the derivatives of functions y = tga and y = cotg x to find the indefinite integrals of functions
(a) y = cotg2 a,   a G (0,tt);
(b) v= ^x^x. ie(°.f)-
O
6.C.32. By repeated use of integration by parts, for all a G R determine
(a) J a2 sin a da;
(b) J a2 ex dx.
o
6.C.33. For example by using integration by parts, determine
can be arranged at the infinitesimal level by exchanging the differential dx for p(x)dx for some suitable function ip. Imagine that at some point a0, the increment of the integrated quantity is given by a/(a0) independently of the size of the increment of a. For example, we may observe the probability that the amount of alcohol per mille in the blood of a driver at a test will be at most a. We might like to integrate over the possible values in the interval [0,a]. With quite a large probability the value is 0. Thus for any integral sum, the segment containing zero contributes by a constant nonzero contribution, independent of the norm of the partition. We cannot simulate such behaviour by multiplying the differential dx by some real function. Instead we generalize the Riemann integral in the following way:
Riemann-Stieltjes integral
Choose a real nondecreasing function g on a finite interval [a,b]. For every partition E with representative & and points of the partition
a = x0,x1,... ,xn = b the Riemann-Stieltjes integral sum of function / (a) as
n i=l
The Riemann-Stieltjes integral
1= f f(x)dg(x)
exists and its value is 7, if for every real e > 0 there exists a norm of the partition S > 0 such that for all partitions E with norm smaller than S,
\SS-I\ < e.
For example, choose g(x) on interval [0,1] as a piece-wise constant function with finitely many discontinuities
ci,..., cj; and "jumps"
a, =   lim g(x) —  lim g(x),
then the Riemann-Stieltjes integral exists for every continuous /(a) and equals
f(x)dg(x) = y2aif(°i)
By the same technique as used for the Riemann integral, we define upper and lower sums and upper and lower Riemann-Stieltjes integral. For bounded functions they always exist, and their values coincide if and only if the Riemann-Stieltjes integral in the above sense exists.
We have already encountered problems with the Riemann integration of functions that are "too jumpy". For a
433
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
f x In2 x dx
function g(x) on a finite interval [a, b] define its variation by
for x > 0.
6.C.34. Using integration by parts, determine
/ (2 - x2) ex dx on the whole real line.
6.C.35. Integrate
(a) f(2x + 5)10dx, iGl;
(b) / ^ dx> x > o;
(c) J"e~x3x2 dx,
(d) /15=^aLf dx,   x G (-1,1);
(e) / ^        a. > 0;
o
o
■ dx,
(g) Je^+3
(h) j sin^/x dx,
x G E; x > 0
by using the substitution method.
o
6.C.36. Forx G (0,1), by using suitable substitutions, reduce the integrals
f -r2    /  x    /fr-       f _!%3i_
to integrals of rational functions.
o
6.C.37. Forx G (-7r/2,7r/2) compute
J  l+sin2 x
using the substitution t = tg x.
6.C.38. How many distinct primitive functions to function y = cos (lnx) does there exist on the interval (0,10)? O
6.C.39. Give an example of a function / on th interval I = [0,1] that doesn't have a primitive function on /. O
6.C.40. Using the Newton integral, compute
(a) Jsinx dx;
0
1
(b) Jarctgx dx;
(c)
(d)
37r/4
/
-tt/4 e
/
l/e
1+sin x
dx:
lnx I dx.
o
6.C.41. Compute
dx.
= SUp^|(7(xi) -_,(_-_!)
where the supremum is taken over all partitions E of the interval [a, b]. If the supremum is infinite, we say that g(x) has unbounded variation on [a, b]. Otherwise we say that g is a function with bounded variation on [a, b]. A function is of bounded variation if and only if it can be written as the difference of two monotonic functions.
As in the discussion of the Riemann integral, we derive the following theorem. We invite the reader to add the details of its proof. The main tools are the mean theorem, the uniform continuity of continuous functions on closed bounded intervals. The variation of g over the interval [a, b] plays the role of the length of the interval in the earlier proofs dealing with Riemann integration.
Properties of the Riemann-Stielties integral
Theorem. Let f(x) and g(x) be real functions on a finite interval [a, b}.
(1) Suppose g(x) is non-decreasing and continuously differentiate. Then the Riemann integral on the left hand side and the Riemann-Stieltjes integral on the right hand side either both exist or do not exist. In the former case, their values are equal
f(x)g'(x)dx = / f{x)dg{x)
O     (2) If f(x) is continuous and g(x) is a function with finite
variation, then the integral Ja f(x)dg(x) exists.
6.3.16. Kurzweil-Henstock integral. The last topic in this chapter is a modification of the Riemann integral, which fixes the unfortunate behaviour at the third point in the paragraph 6.3.1. That is, the limits of the non-decreasing sequences of integrable functions are again integrable. Then we can interchange the order of the limit process and integration in these cases, just as with uniform convergence.
Notice what is the essence of the problem. Intuitively we assume that very small sets must have zero size. Thus the changes of values of the functions on such sets should not change the integral. Moreover, a countable union of such sets which are "negligible for the purpose of integration" should also have zero size. We would expect for example that the set of rational numbers inside a finite interval would have this property, hence its characteristic function should be integrable and the value of such an integral should be zero.
We say that a set A c E has zero measure, if for every e > 0 there is a covering of the set A by a countable system
434
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
o
6.C.42. For arbitrary real numbers a < b determine
J sgnx dx.
a
Recall that sgnx = 1, for x > 0; sgnx = —1, for x < 0; and sgnO = 0. O
6.C.43. Compute the definite integral
} x3 i
o
6.C.44. For example by repeated integration by parts, compute
71-/2
j e2x cos a; dx. o
o
6.C.45. Determine
J x2 e x dx.
o
o
6.C.46. Compute the integral
i
using the substitution method. 6.C.47. Compute
e8 In 2
(b) / jl dx.
o
o
6.C.48. Which of the positive numbers
7T-/2 tv
p := J cos7 x dx,    q := J cos2 x dx o o is bigger? O
6.C.49. Determine the signs of these three numbers (values of integrals)
2 tt 2tt
a := fx3 2X dx;    b := f cosx dx;    c := f dx.
-2 0 0
of open intervals J,, i = 1,2,... such that
oo
^m(Jj) < e.
i=l
m(Ji) means the length of the interval J,.
In the sequel, the statement "function / has the given property on a set B almost everywhere" means that / has this property at all points except for a subset A c B of zero measure. For example, the characteristic function of rational numbers is zero almost everywhere. A piece-wise continuous function is continuous almost everywhere.
Now we modify the definition of the Riemann integral so that restrictions on the Riemann sums are permitted, eliminating the effect of the values of the integrated function on sets of measure zero. This is achieved by a finer control of the size of the segments in the partition in the vicinity of problematic points.
A positive real function S on a finite interval [a, b] is called a gauge. A partition E of interval [a, b] with representatives & is S-gauged, if
& - <K6) < Xi-i < & < Xt < & + <5(&)
for all i.
The norm S of the partition used in the Riemann integration is a special case of constant gauges S(x) = S > 0. In order to restrict the Riemann sums to a gauged partition with representatives in the definition of the integral, it is necessary to know that for every gauge S, a 5-gauged partition with representatives exist. Otherwise the condition in the definition could be satisfied in a vacuous way. This statement is called Cousin's lemma. It is proved by exploiting the standard properties of suprema:
For a given gauge S on [a, b], denote by M the set of all points x e [a, b] such that a 5-gauged partition with representatives can be found on [a, x]. M is nonempty and bounded, thus it has a supremum s. If s ^ b, then there is a gauged partition with representatives at s, where s is in the interior of the last segment. This leads to a contradiction. Thus the supremum is b, but then the gauge 5(b) > 0 and thus b itself belongs to the set M.
Now we can state the following generalization of the Riemann integral.
Call it the K-integrat.
o
6.C.50. Order the numbers
There are many equivalent definitions and thus also names for this K-integral. A complicated approach was coined by Arnaud Denjoy around 1912. Thus the space of real functions integrable on an interval [a, b] in this sense is often called Denjoy space. Other people involved were Nikolai Luzin and Oskar Perron. We can find the integral under their names. The simple and beautiful definition was introduced by Jaroslav Kurzweil, a Czech mathematician still living in 1957. Much of the theory was developed by Ralph Henstock (1923-2007), an English mathematician.
435
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
tt/2 tt/2
A := J cos a; sin2 a; dx,    B := J sin2 a; dx,    C : = o o l
J —x5 5X dx, -l
d:=r^dx+J^dx + j^dx
2tt tv 10
by size. O
6.C.51. By considering the geometric meaning of the definite integral, determine
2
(a) J | x — 1 da;;
-2
0,10
(b) J"  tg x dx;
-0,10 2ir
(c) J sin a; da;.
o
6.C.52. Compute f^1\x\ dx. 6.C.53. Determine
o o
J x5 sin2 x dx. -l
o
6.C.54. Without using the symbols of differentiating and integrating, express
/ a
j 3t2 cos t dt
c2
with variable i£l and a real constant a, if we differentiate with respect to a;. O
Kurzweil-Henstock integral
Definition. A function / defined on a finite interval [a, b] has a Kurzweil-Henstock integral
f(x) dx,
if for every e > 0, there exists a gauge 5 such that for every 5-gauged partition with representatives E, the inequality Ss —1\ < e is true for the corresponding Riemann sum Ss-
6.3.17. Basic properties. When defining the K-integral, only the set of all partitions is bounded, for which the Riemann sums are taken into account. Hence if the function is Riemann integrable, then it is K-integrable, and the two integrals are equal.
For the same reason, the argumentation in Theorem 6.2.8 about simple properties of the Riemann integral applies. This verifies that the K-integral behaves in the same way. In particular, a linear combination of integrable function cf(x) + dg(x) is again integrable and its integral is c f(x)dx + d g(x)dx etc. To prove this, it suffices only to think through some modifications when discussing the refined partitions, which moreover should be 5-gauged.
The Kurweil integral behaves as anticipated with respect to the sets of zero measure:
Theorem. Let the function f, which is zero almost everywhere be defined on the interval [a, b}. Then the K-integral fb f{x)d{x) exists and is zero.
Proof. The proof is an illustration of the idea that the influence of values on a "small" set can be removed by a suitable choice of gauge. Denote by M the corresponding set of zero measure, outside of which f(x) = 0 and write Mj, c [a,b], for the subset of the points for which k — 1 < \f(x) | < k. Because all the sets Mj, have zero measure, each of them can be covered by a countable system of pairwise disjoint open intervals Jk,i such that the sum of their lengths is arbitrarily small.
Define the gauge S(x) for x e Jk,i so that the intervals (x — 5(x),x + 5(x)) are still contained in Jk,i- Outside of M, S is defined arbitrarily.
For any 5-gauged partition E of the interval [a, b] the bound on the corresponding Riemann sum is given as
k = 1,
3=0
3=0
5,eM
<E E 1/(^)1(^+1
k=l j=0
436
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
oo       , n—1 k=l     ^ j=0
To guarantee that this bound is smaller than a given e, it suffices to choose the covering by the intervals Jkj so that
oo
Since EfcLi 2~fe = 1, the result follows. □
Corollary. If the values of f(x) are changed on a set of zero measure, the K-integrability of f(x) is not changed, and neither is the value of its integral.
6.3.18. The fundamental theorems of Calculus. We conclude this chapter with a few remarks on the properties of integration procedures from the point of view of expectations and reality.9
In 6.2.9, we deal with the relation between the derivatives fit) and the antiderviatives (integrals) F(t). Since fit) is assumed continuous, two essential claims collapse into one, resulting in
Fit) = [ fix) dx
up to the choice of the value of F(to). In particular,
t
F'(dx)dx = F(t)-f(t0)
for all choices of F.
More generally, this can be split into two claims which hold for the K-integral under much milder conditions:
1ST and 2ND fundamental theorems of calculus
Theorem. (1) Suppose the K-integral J^f(x)dx exists. Then Fit) = j"* f(x dx) is continuous. The derivative F'(t) exists and equals fit) almost everywhere.
(2) Suppose F{t) is a continuous function on [a, b] and suppose fit) = F1 it) exists for all but countably many exceptional points t in [a, b}. If fit) is defined arbitrarily at those points, then Fit) = fix) dx exists for all t £ [a, b] and equals to F{t) — F{a).
6.3.19. K-integrability and Lebesgue measure. We illustrate the claims in the latter theorem on the indicator function Xq of the rational numbers. Clearly its K-integral
F(t)= I XQix)dx
J a
exists (xq is zero almost everywhere) and equals zero. Its derivative F'(t) is identically zero, and equals xq nearly everywhere. This is a good example of a bounded function which is
A very good and elementary exposition of the K-integral can be found in the short paper Return to the Riemann Integral. The American Mathematical Monthly, Vol. 103, No. 8 (1996), 625-632. by Robert G. Bartle.
437
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
not Riemann integrable, but is integrable in the more general sense.
There are many more K-integrable functions than Riemann integrable functions. There is no difference between proper and improper integrals. More precisely, the K-integral Jb f(x) dx exists if and only if the one-sided limit
is well defined and their values coincide, and similarly for the upper limit b. This is due to the freedom in the choice of the gauges.
There is only an indirect proof that there are bounded functions on a compact interval which are not K-integrable, based on some set-theoretic arguments, but there are no explicit constructions of such functions available.
We say that a set of real numbers M is measurable if the K-integral of its indicator function xm exists. The assignment m : M i-> Jb xm(x) dx for all sets M C [a, b] has the properties of a measure. The set of such measurable sets M is closed under finite intersections and countable unions. The measure m is additive with respect to unions of at most countable systems of pairwise disjoint sets.
This measure coincides with theLebesgue measure. This measure is used in another concept of integration, which is extremely useful in higher dimensional applications, the Lebesgue integral. We do not go into more details here. We remark that a real function / is Lebesgue integrable if and only if its absolute value is K-integrable.
A big advantage of the K-integral compared to other concepts is the possibility of integrating many functions which are not integrable in absolute value. Compare the concepts of convergence and absolute convergence of series.
A typical example is the sinus integral over all reals. The K-integral of the sine function J*0°° dx exists, while the absolute value g(x) = | sinc(a;) | is not Lebesgue integrable. Such integrals are important in models for signal processing where it is necessary to aggregate potentially infinite many interferences canceling each other by different signs.
6.3.20. The convergence theorems. We have dealt with uniform convergence and Riemann integrability. With the K-integral, there is a much nicer and stronger theorem available. A special case is the monotone convergence theorem for uniformly bounded functions fo (x) < f± (x) < ....
438
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Dominated convergence theorem
Theorem. Suppose fo, fi, f2, ■ ■ ■ are all K-integrable functions on an interval [a, b], converging pointwise to the limit function f. If there are two K-integrable functions g and h satisfying
for all n G N and x G [a, b], then f is K-integrable too, and
For monotone convergence, there is a stronger result saying that a sufficient and necessary condition for the K-integrability of the pointwise limit is supn fb fn (x) dx < oo.
This theorem could not be applied in our third example in 6.3.2. There the functions /„ have a "bump" which gets larger but narrower when close to the origin. The functions cannot be dominated by an integrable function.
With the Riemann integral, a similar dominated convergence theorem can be proved, except that we have to guarantee the integrability of the pointwise limit /.
g(x) < fn< h(x),
f(t) dt = lim
439
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
D. Extra examples for the whole chapter
6.D.I. Determine the significant properties of the function
/(z) = -3JTT, xgR\{-1}.
O
By "significant properties" is meant items such as domain, range, zeros, extrema, stationary and inflection points, points of discontinuity, intervals of increasing, decreasing, convexity, concavity, and asymptotes if applicable. Briefly, all the relevant information required to sketch a graph.
6.D.2. Determine the significant properties of the function
/(a
6.D.3. Determine the significant properties of the function
f(x) = *3-3f_+3*+l.
6.D.4. Determine the significant properties of the function
j(x) = y/X&~x.
6.D.5. Determine the significant properties of the function
f(x) =arctg
6.D.6. Determine the significant properties of the function
In x
6.D.10. Determine the significant properties of the function
x'-2 x-1 ■
440
o
o
o
o
Especially, find the extremes, the points of inflection and the asymptotes and sketch its graph. O
6.D.7. Determine the significant properties of the function
ln(a;2 - 3a; + 2) + x.
In particular, find the extremes, the points of inflection and the asymptotes: O
6.D.8. Determine the significant properties of the function
(x2-2)ex2-\
In particular, find the extremes, the points of inflection and the asymptotes: O
6.D.9. Determine the significant properties of the function
ln(2a;2-a;-i).
Among other things, find the extremes, the points of inflection and the asymptotes. O
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Among other things, find the extremes, the points of inflection and the asymptotes.
6.D.11. Using any basic formulas, determine a primitive function for the function
(a) y = \JxsJx yfx,   x G (0, +00);
(b) y={2x+ 3X)2 , 16I;
(c) y = 77=4p' xe (-M);
cos x
1+sin x '
6.D.13. Determine
J i+x4 dx,    x G .
6.D.14. Determine
S x2-2x+3 x £
6.D.15. For x G (0,1), compute
f ( 7 2+\\ H—7t=t=^ + 4sinx — 5cosx ) da.
6.D.17. Determine
Compute
(a) J a™ In a; da,    a; > 0, n / -1;
(b) /       da,   a G R.
6.D.19. Determine
o
o
6.D.12. Find a primitive function for the function
y = *x + rra
on the interval (-2,2). O
o
o
o
6.D.16. Determine the indefinite integrals
(a) Jarctga da,   a G R;
(b) J^dx, a>0
using integration by parts. O
o
o
441
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
o
6.D.20. Compute the indefinite integral of the function
y= (x2+x+iy' x G
o
6.D.21. Determine
6.D.22. Integrate
6.D.23. Compute the integral
o
/ ^531 dx, x=£l.
o
J dx, l£Rs{l,2}.
O
6.D.24. For a; G (0, f), compute
(a) J sin3 a; cos4 a; da;;
(b) f |+cos29x dx;
v y  J   l+cos 2x '
(c) J 2 sin2 | da;;
(d) J cos2 a; da;;
(e) J cos5 a; Vsina; da;;
(f) f _dx_.
^ '  J  sin2 x cos4 x '
(p-)   f    dx ■
Vo/  J  sin3 x '
(h) /ife-
o
442
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.25. Compute the indefinite integral
1
x4 + 3x3 + 5x2 + Ax + 2
áx.
6.D.26. Compute the integral
"í siní
1 — cos2 t
dt.
6.D.27. Compute the integral
ln2 dx
3ex
6.D.28. Compute:
(i) J02 sin x sin 2x dx,
(ii) J sin2 x sin 2x dx.
6.D.29. Compute the improper integral
+oo
(a) / T^;
— oo +oo
(b) / f;
0
(c) J^^dx;
o
1
(d) J ln I a; I dx.
-l
6.D.30. Determine
6.D.31. Compute the improper integrals
6.D.32. Compute
371-/2
r Tf4^ dx.
J     l+sm x 0
+oo
/ midx.
+oo
6.D.33. By using the substitution method, compute
O
o
o
o
o
o
o
o
443
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
0 _   2 °° -(1/x)
J x e~x dx;    J —^— dx.
-oo 0
6.D.34. Compute the integrals
/ ^ dx>   / ^ ^    I ^ dx-
6.D.35. Find the values of a G R, for which
+oo
(a)  / ff G R;
i i
dx
(b) / £ G R;
o
+oo
(c) / sin(ax) dx G
6.D.36. For which p, g G R is the integral
o
o
o
+oo
r dx
J xP(lnx)i 2
finite? O 6.D.37. Decide if the following is true:
+oo
(a) /
— oo +oo
(b) / j% e R;
— oo +oo
(c) / ^¥+f dx G R. l
O
444
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
Solutions of the exercises
6A.4.
6A.5.p(5){x) = 12-5!;p(6)(x) = 0. 6A.6. 212 e2x + cos x. 6AJ.fi2S> (x) = -sinx + 226 e2a:. 6A.8. All of them.
6A.9. (a)u(0) = 6m/s; (b)t = 3s, s(3) = 16m; (c)v(A) = -2m/s,a(4) = -2m/s2 6A.12. f + i (x - 1) - i (x - l)2 + i (x - l)3. 6.A./3. (a) 1 + ^; (b) 1 - ^; (c) x - ^; (d) x + ^; (e) x + x2 + ^. 6.A./4. 2 (x - 1) - (x - l)2 + f (x - l)3 - i (x - l)4. 6.A./5.
3(l+s)3 '
. „;„ 1 o ^___-
180        6-1803 '
6A.16.x-lr;sml  sa — -       3; lim^oi---j-
6.A./7. E^=0 |i',n>8,ii6N.
6.A./S. (x - l)3 + 3 (x - l)2 + (x - 1) + 4.
6.A.26. 1    102.2 + 104.4!-6.A.27.1 — 3x + ^jx4; above the tangent line.
6A.30. It's convex on intervals (—oo, 0) and (0,1/2); concave on interval (1/2, +oo). It has only one asymptote, the line y = 7r/4 (v ±oo).
6A.31. (a) y = 0 at — oo; (b) x = 2 - horizontal, y = 1 v ±oo. 6A.32. y = 0 for x —> ±oo. 6^4.33. j/ = In 10, j/ = x + In 3.
S~„ , Sup = ^, S~n, w = ^; yes, it is. 6.B.15. 2x3 + 3x2 - 2x - 13 + ^gff. 6.B.16.X3-\x + l + ^TT).
6.B.17. (a) 4b +      ~       (b) § - 4j +        + ■ 6.B./S. -4j + 4 -
x — 2       xJ x
6.B.19. 44 +
x+l    ' x2—4x+13' 2    . 2o;-3
6.B.20. 4 - ^ +
ar3"—a;+2 *
6.5.21. - - 4 + 4 - -4t.
00     A    -I-   B  -i- £. -I-       -Da:+E 1 FaJ+G
OJi^i. :c_2 ~r '    x        (3x2+x+4)2  ~*~ 3x2+x+t
6.B.23.1 + 4 - 3 + -V 6.B.24.(a)31n|x-2|;(b)7-42F. 6.B.44.       for a e (0,1), 00 else. 6.C.2.
i=0
converges for all real x. 6.C.3.
2n-l
2n
E(-!)K
n = l
converges for all real x.
(2n)!
445
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.C.4.
3(-l)"+1
/(-) = £■
X
converges for x e (—1,1].
6.C.5. It's good to realize we're expanding \ ln(x).
/(x) = ^(-l)8+1^(x-l)8,
Converges on interval (0, 2].
6.C.7. The error belongs to the interval (0,1/200).
6.C.8. 0 < If Ins dx < ^,     x Ins dx = In4 - f.
6.C.P.     Vx"dx = | (2v^2 - 1).
6.C.19. y = arctg x. 6.C.20. Exactly for x e
6.C.22.
(-l)i+122i   / 7TN2i+l
/(x) = l/2 + ^-
i=o   (2J + 1)!   V 4 The series converges for all x e _R.
6.C.23. e~ o £ (* -1)"; e~ o ^
6.C.24. /(x) = x, x e R; yes. 6.C.25. No.
"•l—'<»• Z^„=0   (2n)!   X '
6.C.27. (a) 1 -         + j^r! (b) \ -6C28 r°° _-_T2"+1
O.C.-iO. _,„ = _, (2n + l) n! X
6.C.29. y = arctg x.
6.C.30. Exactly for x e (- §, §), we have
OO
= i T f--Tx"
5+2s        5    i—<   \     5 J
71 = 0
6.C.31. (a) -cotg x - x + C; (b) tg x - cotg x + C.
6.C.32. (a) -x2 cos x + 2x sin x + 2 cos x + C; (b) e31 (x2 - 2x + 2) + C. 6.C.33. ^ (2 In2 x - 2 lnx + l) + C. 6.C.34. (2x-x2)tc +C.
6.C.35. (a) (2a:+25)11 + C; (b)        + C; (c) -f e^3 + C; (d) 5 arcsin3 x + C; (e) ^ + C;
(f) arctg2     + C; (g)     arctg ^ ex j + C; (h)2 sin     - 2^ cos     + C.
6.C.36. For example 1 — x = t2x gives f ^_^2 ^4 dt; and \/x2 + x + 1 = x + j/ leads to f g2^_2.
6.C.37. ^2 tg x) + C.
6.C.38. Infinitely many.
6.C.39. For example, / can attain a value of 1 at rational points of the interval and be zero at irrational points. 6.C.40. (a) 2; (b) f -       (c) 2In (l + v% (d) 2 - f.
446
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.C.42. \b\ - \a\. 6.C.43. i In 2. 6.C.44. \ (e* - 2). 6.C.45.e- 5e_1. 6.C.46. §.
6.C.47. (a) 4; (b) ±=£^.
6.C.4S. p < g.
6.C.49. a > 0; 6 = 0; c > 0.
6.C.50. C<D = 0< + <73.
6.C.5/. (a) 5; (b) 0; (c) 0.
6.C.52.1.
6.C.53. 0.
6.C.54. -6x5cosx2.
6D.1. The range is (—oo, 0] U [4, +oo). Function / is not odd, even nor periodic. It has a single discontinuity xo = — 1 with
lim   f(x) = —oo,      lim   f(x) = +oo.
x—> — 1+ x—> — 1 —
The function intersects the x axis only at the origin. It is positive for x < —1 and not positive for x > —1. It can be shown easily that
lim   f(x) = +oo, lim  f(x) = —oo;
x—>—OO X—S- + 00
This implies that / is increasing on the intervals [—2, —1), (—1,0] and decreasing on the intervals (—oo, —2], [0, +oo). At the stationary
point xi = 0 it reaches a strict local maximum and at the stationary point xi = —2 it has a local minimum yi = 4. It is convex on the
interval (—oo, —1) and concave on the interval (—1, +oo). It does not have a point of inflection. Thelinex = —lis a horizontal asymptote,
the inclined asymptote at ±oo is the line y = -x + l. For example, /(-3) = 9/2, / (-3) = -3/4, /(l) = -1/2, / (1) = -3/4.
6D.1. The function is defined and continuous ont\ {0}. It is not odd, even, nor periodic. It is negative on the interval (1, +oo). The only point of intersection of the graph with the axes is the point [1,0]. At the origin, / has a discontinuity of the second kind and its range is is R, because
lim f(x) = +oo,      lim  f(x) = —oo,      lim   f(x) = +oo.
X—5-0 X—S- + 00 X—>—OO
Moreover,
f(x) = -^, xeMx{0}, f'(x) = i\, xeBI\{0}.
The only stationary point is xi = —$2. The function / is increasing on the interval [xi,0), decreasing on the intervals (—00,2:1], (0, +00). Hence at xi it has a local minimum yi = 3/\/ 4. It has no points of inflection. It is convex on its whole domain. The line x = 0 is a horizontal asymptote and the line y = —x is an inclined asymptote at ±00.
6D.3. The function is defined and continuous ont\ {1}. It is not odd, even nor periodic. The points of intersection of the graph of / with the axes are the points [l — \/2,0] and [0, —1]. At xo = 1, the function has a discontinuity of the second kind and its range is R, which follows from the limits
lim f(x) = —00,     lim f(x) = +00,      lim   f(x) = +00.
X—5-1— X—5-1+ X—S-ioO
After the arrangement
f(x) = (x-l)2 + ^-, xeM\{l},
it is not difficult to compute
f'(x) = 2^^4, xeMx{l},
The only stationary point is xi = 2. The function / is increasing on the interval [2, +00), decreasing on the intervals (—00,1), (1, 2]. Hence at the point xi it attains the local minimum yi = 3. It is convex on the intervals (—00,1 — v^2), (1, +00) and concave on the intervals (l — y/2, l). The point x2 = 1 — \/2 is a point of inflection. The line x = 1 is a horizontal asymptote. The function does not have any inclined asymptotes.
447
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6.D.4. The function is defined and continuous on R. It is not odd, even nor periodic. It attains positive values on the positive half-axis, negative values on the negative half-axis. The point of intersection of the graph of / with the axes is only at the point [0,0]. The derivative is:
f(x) = -^-yZe-x,    xeR\{0}, /(0)=+oo, jv x^      yv xD
The only zero point of the first derivative is the point xo = 1/3. The function / is increasing on the interval (—oo, 1/3] and decreasing on the interval [1/3, +oo). Hence at the point xo, it has an absolute maximum yo = 3e. Since lima;->-<x> f(x) = —oo, its range is (—oo, j/o]. The points of inflection are
XI = X2=0,     X3 =
It is convex on the intervals (xi, X2) and (x3, +oo), concave on the intervals (—oo, xi), (x2, X3). The only asymptote is the line y = 0
at +oo, i.e. lims^+oo f(x) = 0.
6D.5. The function is defined and continuous ont\ {2}. It is not odd, even nor periodic. It's positive exactly on the interval (0, 2). The only point of intersection of the graph of / with the axes is the point [0,0]. At xo = 2, the jump of size it is observed, as follows from the limits
lim f(x) = f,     lim f(x) = -f.
We have
fV) = l?^rv xeRx{2}, /'(*)= -eRx{2}. The first derivative does not have a zero point. The function / is therefore increasing at every point of its domain. Since
lim  /(x) = -f,      lim  /(x) = -f,
X—S-—OO X—S- + 00
its range is the set (—tt/2, tt/2) \ {—7r/4}. The function / is convex on the interval (—oo, 1), concave on the intervals (1,2), (2, +oo). Thus the point xi = 1 is a point of inflection with /(l) = 7r/4. The only asymptote is the line y = —7r/4 at ±oo. 6D.6. Domain R+, global maximum x = e, point of inflection
x = sjl?
, increasing on (0, e), decreasing on (e, oo), concave on (0, Ve?, convex on (Ve?, oo), asymptotes x = 0 and y = 0, lim^o f(x) = —oo, lim^oo f{x) = 0.
6D.7. Domain R \ [1,2]. Local maximum x = 1~2V^', concave on the whole domain, asymptotes x = 1, x = 2. 6D.8. Domain R. Local minimas —1,1, maximum at 0. Even function. Points of inflection ±775. no asymptotes. 6D.9. Domain R \ [—5,1]. No global extremes. No inflection points, asymptotes x = — |, x = 1.
6D.10. Domain R \ {1}. No extremes. No points of inflection, convex on (—00,1), concave on (1, 00). Horizontal asymptote x = 1. Inclined asymptote y = x + 1.
6D.U. (a) £ x-ftf; (b) £j + 2^ + ^; (c) =f£; (d) In (1 + sinx). &D./2.e* + 3arcsin§. 6.D.13. \ In (1 + x4) + C. 6.D.14. 2v/2arctg ^ + C.
6D.15. In J x ~x J + I arcsinx — 4cosx — 5 sinx + C.
6.D.16. (a) xarctgx - ln(1+3;2) + C; (b) ^ + C. 6.D.17. x - 2^/x~ + 2 in (1 + ^/xj + C. 6.D.18. (a) ^ lnx - 7^2 + C; (b) + C.
6.D.19. f In (x2 + 4x + 8) - i arctg s±l + C.
^tg ^ + 5^Fiy + G
6^.2/. i In £±£r + ^ arctg ^ + C.
6.D.22. § In |x - 1| - § In (x2 + x + 1) - ^ arctg ^ + C.
448
CHAPTER 6. DIFFERENTIAL AND INTEGRAL CALCULUS
6D.23. In (| x - 1 | (x - 2)4) - ^ + x + C.
6.D.24. (a) - + C; (b) lJf + § + C; (c) x - sinx + C; (d) f +        + C; (e) f sini x - f sini x + £ sin^ x + C; (f)
4i + 2tgx-^ + C;(g)iln|tgf|-^^ + C;(h)ln|tgf|+G
6.D.25. \ ln(x2 + 2x + 2) - ± ln(x2 + x + 1) + iy^arctan ^12_tt^_^ + c.
6.D.26. iln(i±jff).
6.D.27. -| - fin 2. 6.D.2S.
co I.
(ll)    2 Sm x-
6.D.29. (a) tt; (b) +oo; (c) 20; (d) -2. 6.D.30. -oo. 6.D.31. ^tt. 6.D.32. 2^1 tt.
- |; 1.
e ' e      ez ' ez
6D.35. (a) a > 1; (b) a < 1; (c) a = 0.
For p > 1, q e R and for p = l,q > 1. 6.D.37. (a) true; (b) false; (c) true.
449
CHAPTER 7
Continuous tools for modelling
How do we manage non-linear objects? - mainly by linear tools again...
A. Orthogonal systems of functions
If we want to understand three-dimensional objects, we often use (one or more) two-dimensional plane projections of them. The orthogonal projections are special in providing the closest images of the points of the objects in the chosen plane.
Similarly, we can understand complicated functions in terms of simpler ones. We consider their projections into the (real) vector space generated by those chosen functions. Perhaps we recall from Chapter 2 that the orthogonal projections
In this chapter, we mainly deal with applications of the tools of differential and integral calculus. We consider a variety of problems related to functions of one real variable.
The tools and procedures are similar to the ones shown in Chapter 3, i.e. we consider linear combinations of selected generators and linear transformations.
This chapter serves also as a useful consolidation of background material before considering functions of several variables, differential equations, and the calculus of variations.
We begin by asking how to approximate a given function by linear combinations from a given set of generators. Approximation considerations lead to the general concept of distance. We illustrate the concepts on rudiments of the Fourier series. Our intuition from the Euclidean spaces of low dimensions is extended to infinite dimensional spaces, particularly the concept of orthogonal projections.
The next part of this chapter focuses on integral operators. These are linear mappings on functions which are defined in terms of integrals. Especially, we pay attention to convolutions and Fourier analysis. Throughout all these considerations, we work with real or complex valued functions of one variable.
Only then do we introduce the elements of the theory of metric spaces. This should enlighten the concepts of convergence and approximation on infinite dimensional spaces of functions. It will also cover our needs in analysis on Euclidean spaces R™ in the next chapter.
1. Fourier series
7.1.1. Spaces of functions. As usual, we begin by choosing W appropriate sets of functions to use. We want #4Sfx] enough functions so that our models can conve-cSkJrUr~_ niently be applied in practice. At the same time, the functions must be sufficiently "smooth" so that we can integrate and differentiate as needed.
All functions are defined on an interval / = [a,b] cl, where a < b. The interval may be bounded, (i.e., both a and b are finite), or unbounded (i.e., either a = — oo, or & = +oo, or both).
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
were easily computed in terms of inner products. Now we do the same for the infinite dimensional spaces of functions.
The inner product mimics the product of scalars. Actually, the simplest way is to introduce the inner product to suitable vector spaces of functions on a given interval I = [a, b] C K in this way:
</ 9)
f(x)g(x)dx.
We refer to this inner product as L2, see 7.1.3 for more information, including the complex valued functions.
Such a scalar product allows us to calculate the projections to finite dimensional subspaces in the same way as we did in finite dimensional vector spaces.
7.A.I. Given the vector subspace (x2,1/x) of the space of real-valued functions on the interval [1,2], with the L2 product, complete the function 1 jx to an orthogonal basis of the subspace. Determine the orthogonal projection of the function x onto it, and compute the distance of the function x from this subspace.
Solution. First, consider the basis. It is required that the function 1 jx be one of the vectors of the basis. The vector space in question is generated by two linearly independent functions, thus its dimension is 2. All of the vectors in it are of the form a ■ i + b ■ x2 for some a,b e K. It remains to find one more vector of the basis which is orthogonal to the function fi = 1/x. According to the Gram-Schmidt process, we seek it in the form f2 = x2 + k ■ \, k e K. The real constant k can be determined from the condition of orthogonality:
0 :
Therefore,
1      2      , 1
-,xz + k ■ -
X X
-,x2 ) + k(-,-
X        / \ X X
k :
(I 1\      f21.1 Hr
\x' xl Jl  x    x aX
Thus, the requested orthogonal basis is (^, x2 — |).
Next, we calculate the projection px of the function x onto this subspace (see (1) on page 113). We find
(x,-)    1 (x,x2-^)
_  \   ' xl     _   I \   ' xl
Px —   / 1     1 v I
(i,1-)    X      (a;2_3 2_3\
\ x 1 x * \ x1 x *
v x'
2     15/2 ^
Spaces of piecewise smooth functions
We denote by 5° = 5° [a, 6] the set of all piecewise continuous functions on I = [a, b] with real or complex values. Otherwise put, all functions / in 5° = 5° [a, 6] have only finitely many points of discontinuity on bounded intervals. Moreover, / has finite one-sided left and right limits at every point in [a, 6]. In particular, / is bounded on all bounded subintervals.
For every natural number k > 1, we consider the set of all piecewise continuous functions / such that all their derivatives up to order k (inclusive) lie in 5°. We denote this set by Sk [a,b], or briefly Sk. Note that the derivatives of functions in Sk need not exist at all points, but their onesided limits must exist.
If the interval I is unbounded, we often consider only those functions with compact support. A function with compact support means that it is identically zero outside some bounded interval of the real line. For unbounded intervals, we denote by Sk the subset of those functions in Sk which have compact support.
Functions in 5° are always Riemann integrable on the bounded interval I = [a, b], with both
\f(x) dx < oo, and
|/(a;)| da; < oo.
Both integrals are finite for unbounded intervals if the function / has compact support.
7.1.2. Distance between functions. The properties of limits and derivatives ensure that Sk and Sk are vector spaces. In finite-dimensional spaces, the distance between vectors can be expressed by _ means of the differences of the coordinate components. In spaces of functions, we proceed analogously and utilize the absolute value of real or complex numbers and the Euclidean distance in the following way:
The Li distance of functions
The Li-distance between functions / and g in cfP is denned by
11/-slli = / \f(x) -g(x)\ dx.
■J a
If g = 0, then the distance from / to the zero function, namely        is called the Li-norm (ie. length, or size) of
The Li-distance between functions / and g (when both are real valued) expresses the area enclosed by the graphs of these functions, regardless of which function takes greater values. We observe that ||/ — g\\ 1 > 0.
Since / and g are both piecewise continuous functions, 11/ ~ 5II i = 0 only if / and 5 differ in their values at most at the points of discontinuity, and hence at only finitely many points on any bounded interval. Recall that we can change
451
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Finally, the distance of a vector from the subspace is given by the norm of the difference between this vector and its projection. In this case:
-Pxh
/408
2
a
= 0.0495.
15 34
da;
□
7.A.2. Consider the real vector space of functions on the interval [1,2] generated by the functions \ ,x\,x\ with the L2 product. Complete the function \ to an orthogonal basis of this space. Determine also the projection of the functions ^ and a onto the above vector space. Then find their distances from this vector space.
Solution. As in the previous exercise, use the Gram-Schmidt orthogonalization process (with the given scalar product). Successively,
fi(x) = -,
1	
a	
1	3
a2	~ 4a'
1	3
a3	~ 2^
+
The projection of a is 2f1 + 96(—| 5760(-| ln(2) + ||)/3, while the distance is 0.035.
ile the distance
+ ln(2))/2 + approximately
The illustrations show the functions a and 1 /a4 and their approximations. We can see that the function ^, which is the one whose shape is similar to that of one or more generators, is approximated much better by the projection. □
But there are plenty of inner products on functions. We mention two of them in the following exercises.
7.A.3.   Verify that for each interval I = [a,b] c K and positive continuous function cu on I, the formula
the value of any function at a finite number of points, without changing the value of the integral.
If in particular, / and g are both continuous on [a,b], then 11/ — 111 =0 implies /(a) = g(x) for all a e [a, b]. Indeed, if /(a0) 7^ g(xo) at a point a0, a < x0 < b, and if / and g are both continuous at a0, then / and g also differ on some small neighbourhood of a0, and this neighbourhood, in turn, contributes a non-zero value into the integral, so that then 11/>o.
If we have three functions /, g, and h, then, of course,
b pb
/(a) — g{x) \ dx =     |/(a) — h(x) + h(x) — g{x) \ dx
J a
fb fb
< I   \f(x) — h(x)\dx+ I   \h(x) — g(x)| dx,
J a J a
so the usual triangle inequality
II/-5II1 < +
holds. To derive this inequality, we used only the triangle inequality for the scalars; thus it is valid for functions /,j£ cfP with complex values as well.
11 / — g 111 is not the only way to measure distance between two functions / and g. For another way:
The ivo—distance
The L2-distance between functions / and g in cfP is denned by
ll/-sl|2 =
b \ 1/2
|/(a) - g(x)\2 dx
If g = 0, then 11 /112, the distance from / to the zero function, is called the L2 norm of /.
Clearly ||/||2 > 0. Moreover, ||/||2 = 0, implies that /(a) = 0 for all a except for a finite set of points in any bounded interval. As above for the L1 norm, ||/ — g\\2 = 0 only if / and g differ in their values at most at the points of discontinuity, and hence at only finitely many points on any bounded interval. In particular, if / and g are both continuous for all a, then ||/ — g\\2 = 0 implies /(a) = g(x) for all a.
The square of 11 /112 for a function / is
Il/I|22= fb\f(x)\2dx
J a
and it is related to the well-defined symmetric bilinear mapping of real or complex functions to scalars
if, 9) = /  f(x)g(x) dx
J a
since
(//}= /   /(a)/(a)da= /   \f(x)\2 dx = \\f\\22.
J a J a
We can use therefore all the properties of inner products in unitary spaces as described in Chapter 3. In particular, the
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
(f,a)u= f{x)g(x)w(x)dx
J a
defines an inner product on the continuous functions on /. Verify that choosing I = (—1,1) and ui(x) = (1 — a;2)-1/2, the functions
Tk(x) = cos(fc arccos(a;)),    k £ N,
form an orthogonal system of polynomials with respect to this inner product. These polynomials are called Chebyshev polynomials.
Solution. We compare the denning formula with the L2 inner product above: Consider the substitution x = <p(z), where ip is the inverse function to z = f* ui(t) dt. The inverse exists, since tu is positive and so z is a strictly increasing function of x. Thus, dz = üj(x)dx and
(f,d)u = I   f{x)g(x)w(x) dx
1
rv,-\b)
f{f{z))g{p{z)) dz.
In particular, the "coordinate change" x = p(z) identifies the vector space of continuous functions on I with the space of continuous function on another interval equipped with the L2 inner product and so (, )u] is an inner product. Of course, the properties of the inner product can be checked directly.
In this special case, ui(x) = j^(arccos(a;)), and thus the above substitution yields
(Tr,Ts)t_,= /   cos(rz) cos(sz) d z. Jo
We are dealing with improper Riemann integrals (integrating the unbounded function cu), but this does not cause any problem. By using the well known trigonometric formula
cos(rz) cos(sz) = — (cos((r — s)z) + cos((r + s)z))
the integral is easily evaluated. It vanishes for all r ^ s.
The remaining task is to show that Tk (a;) is a polynomial. The definition itself shows directly that
T0{x) = \, T1(x)=x.
By the above trigonometric formula,
Tn+i(x) +T„_i(a;)
= cos((n + 1) arccos(a;)) + cos((n — 1) arccos(a;)) = 2 cos(n arccos(a;)) cos(arccos(a;)) = 2xTn(x)
so that for all positive n,
Tn+i(x) = 2xTn(x) - T„_i(a;).
inner product satisfies both linearity in the first argument and the Hermitian symmetry
(f,9)=JgJ)-
It is a symmetric bilinear mapping in the real case.
7.1.3. Orthogonality. In Chapters 2 and 3, we dealt with jSU        finite-dimensional real or complex vector ^pp-pf*'' spaces.   Most properties derived there con-'^^ttc:-/   cerned pairs or finite sets of vectors. Now, we 3%£~;-- -   can do just the same with functions. We restrict our definition of the inner product to any vector subspace generated by only finitely many functions fi,...,fk (over real or complex numbers, according to our need). We again obtain a well-defined inner product on this finite-dimensional vector subspace and so all our considerations from the finite dimensional linear algebra apply again.
As an example, consider the monomial functions fi(x) = x1, i = 0,...,k. In cS°, these generate the (k + 1)-dimensional vector subspace Rk[x] of all polynomials of degree at most k. The inner product of two such polynomials is given by integration. Every polynomial of degree at most k is uniquely expressed as a linear combination of the generators fo, - ■ ■, fk- Moreover, if we can arrange the choice of generators to satisfy
(1)
(fufj)
0 for i 7^ j,
1 for i = j,
then the computations become much easier than they would otherwise.
Recall the Gram-Schmidt orthogonalization procedure, see 2.3.20. This procedure transforms any system of linearly independent generators /. into new (again linearly independent) orthogonal generators gi of the same subspace, i.e. {9ii9j) — 0 for all i =^ j. We can calculate them step by step. Put _>i = fu and
f               , ife+i,9i) 9e+x = Je+x + axgx H-----h aege,   a{ =---———
\\9i\\2
for£> 1.
To illustrate, we apply this procedure to the three polynomials 1, x, x2 on the interval [—1,1]. Put gx = 1, and generate the sequence
51 = 1
92 = x-
93 = x2
\\9i\
x ■ ldx \ ■ gx = x — 0 = x
\\9i\\- \J-i
x2 ■ ldx \ ■ gx
\\92\
x2 ■ x dx J ■ g2 = x2 —
The corresponding orthogonal basis of the space R2 [x] of all polynomials of degree less than three on the interval [—1,1] is 1, a;, a;2 — 1/3. Rescaling by appropriate numbers so that
453
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
This is the recurrent definition of Chebyshev polynomials. That all Tk(x) are polynomials now follows by induction.
□
7A.4. Show that the choice of the weight function ui(x) = e~x and the interval I = [0, oo) in the previous example leads to an inner product for which the Laguerre polynomials
kj
the basis elements all have length 1, yields the orthonormal basis
M*) = E
k=0
form an orthonormal system.
k\
o
7A.5. Check that the orthonormal systems obtained in the previous two examples coincide with the result of the corresponding Gram-Schmidt orthogonalisation procedure applied to the system 1, x, x2,..., xn,..., using the inner products (, )u], possibly only up to signs. O Given a finite-dimensional vector (sub)space of functions, calculate first the orthogonal (or orthonormal) basis of this subspace by the Gram-Schmidt orthogonaliza-tion process (see 2.3.20). Then determine the orthogonal projection as before. See the formula (1) at page 113.
7A.6. Given the vector subspace {sm(x),x) of the space of real-valued functions on the interval [0, it] , complete the function a; to an orthogonal basis of the subspace and determine the orthogonal projection of the function \ sin(a;) onto it. O
7A.7. Given the vector subspace (cos(a;), x) of the space of real-valued functions on the interval [0, it] , complete the function cos (a) to an orthogonal basis of the subspace and determine the orthogonal projection of the function sin(a;) onto it.
o
B. Fourier series
Having a countable system of orthogonal functions Tk, k = 0,1,..., as in the examples above, we may sequentially project a given function / to 11 the subspaces 14 = (T0,..., Tk). If the limit of these projections exists, this determines a series built on linear combinations of Tk. Under additional conditions, this should allow us to differentiate or integrate the function / in a similar way, as we did with the power series.
We consider one particular orthogonal system of periodic functions, namely that of J.BJ. Fourier.
The periodic functions are those describing periodic processes, i.e. f(t+T) = f(t) for some positive constant T e K,
hi
h2
2' V 2
For example, hi = and
-l
llSif '
x, h3
rdx ■
(3a;2-1).
We could easily continue this procedure in order to find orthonormal generators of Rk [x] . The resulting polynomials are called Legendre polynomials.
Considering all Legendre polynomials h{, i = 0,..., we have an infinite orthonormal set of generators such that polynomials of all degrees are uniquely expressed as their finite linear combinations.
7.1.4. Orthogonal systems of functions. Generalizing the latter example, suppose we have three polynon-ials hi, h2, h3 forming an orthonormal set. For any polynomial h, we can put
H = (h, hi)hi + (h, h2)h2 + (h, h3)h3.
We claim that H is the (unique) polynomial which minimizes the L2-distance \\h - H\\. See 3.4.3.
The coefficients for the best approximation of a given function by a function from a selected subspace are obtained by the integration introduced in the definition of the inner product.
This example of computing the best approximation of H by a linear combination of the given orthonormal generators suggests the following generalization:
Orthogonal systems of functions
Every (at most) countable system of linearly independent functions in 5^ [a, b] such that the inner product of each pair of distinct functions is zero is called an orthogonal system of functions. If all the functions /„ in the sequence are pair-wise orthogonal, and if for all n, the norm ||/n||2 = 1, we talk about an orthonormal system of functions.
Consider an orthogonal system of functions fn G 5° [a, b] and suppose that for (real or complex) constants c„, the series
oo
F(X) = ^cnfn(x)
71 = 0
converges uniformly on a finite interval [a, b]. Notice that the limit function F(x) does not need to belong to 5° [a, b], but this is not our concern now.
By uniform convergence, the inner product (F, fn) can be expressed in terms of the particular summands (see the corollary 6.3.7), obtaining
oo j-b
(F,fn) = E Cm /    fm(x)fn(x) dx = Cn\\fn\\22, m=0 Ja
454
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
called the period of /, and all t G R. One of the fundamental periodic processes which occur in applications is a general simple harmonic oscillation in mechanics. The function f(t) which describes the position of the point mass on the line in the time t is of the form
(1)
f(t) = asin(o;i + b)
for certain constants a,u > 0, i e K. In the diagram on the left, f(t) = sin(i + 1) and on the right, f(t) = sin(4i + 4):
Applying the standard trigonometric formula
sin(a + ß) = cos a sin ß + sin a cos ß, with a, ß G R, we write the function f(t) alternatively as
(2) f(t) = ccos(o;i) + dsin(cdt),
where c = a sin b, d= a cos b.
7.B.I. Show that the system of functions 1, sin(a;), cos(a;), ..., sin(na;), cos(na;),... is orthogonal with respect to the L2 inner product on the interval I = [—tt, it] . O
Building an orthogonal system of periodic functions sin(na;) and sin(na; + 7r/2) = cos(na;) leads to the classical Fourier series. 1
In application problems, we often meet the superposition of different harmonic oscillations. The superposition of finitely many harmonic oscillations is expressed by sums of functions of the form
fn(x) = an cos(ncjx) + bn sin(na;a;) for n G {0,1,..., m}. These particular functions have prime period 27r/(neu). Therefore, their sum
(3)
2
+      (a„ cos(na;a;) + bn sin(nüjx))
is a periodic function with period T = 2tt/uj.
^The Fourier series are named in honour of the French mathematician and physicist Jean B. J. Fourier, who was the first to apply the Fourier series in practice in his work from 1822 devoted to the issue of heat conduction (he began to deal with this issue in 1804-1811). He introduced mathematical methods which even nowadays lie at the core of theoretical physics. He did not pay much attention to physics himself.
since each term in the sum is 0 except when m = n. Exactly as in the example above, each finite sum J2n=o cnfn{x) is the best approximation of the function F(x) among the linear combinations of the first k + 1 functions /„ in the orthogonal system.
Actually, we can generalize the definition further to any vector space of functions with an inner product. See the exercise 7.A.3 for such an example. For the sake of simplicity we confine ourselves to the L2 distance, but the reader can check that the proofs work in general.
We extend our results from finite-dimensional spaces to infinite dimensional ones. Instead of finite linear combinations of base vectors, we have infinite series of pairwise orthogonal functions. The following theo-f ^ rem gives us a transparent and very general answer to the question as to how well the partial sums of such a series can approximate a given function:
7.1.5. Theorem. Let fn, n = 1, 2,..., be an orthogonal sequence of (real or complex) functions in 5° [a,b] and let g G 5° [a, 6] be an arbitrary function. Put
cn = H/nll   2  /    g(x)fn(x) dx. J a
Then
(1) For any fixed n G N, the expression which has the least L2-distance from g among all linear combinations of functions f\ ,...,/„ is
n
hn = y^Cjfj(x).
i=l
(2) The series X^Li lc™|2||/«l|2 always converges, and moreover
oo
71=1
(3) The equality ^77=1 cn\\fn\\2 = ||ff||2 holds if and only if
Hindoo ||g - sk\\2 = 0.
Before presenting the proof, we consider the meaning of the individual statements of this theorem. Since we are working with an arbitrarily chosen orthogonal system of functions, we cannot expect that all functions can be approximated by linear combinations of the functions /,.
For instance, if we consider the case of Legendre orthogonal polynomials on the interval [—1,1] and restrict ourselves to even degrees only, surely we can approximate only even functions in a reasonable way. Nevertheless, the first statement of the theorem says that the best approximation possible (in the L2-distance), is by the partial sums as described.
The second and third statements can be perceived as an analogy to the orthogonal projections onto subspaces in terms of Cartesian coordinates. Indeed, if for a given function g, the series F(x) = X^i cnfn(x) converges pointwise, then the function F(x) is, in a certain sense, the orthogonal projection of g into the vector subspace of all such series.
455
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.B.2. Show that the system of functions 1, sin(na;a), cos(na;a), for all positive integers n is orthogonal with respect to the L2 inner product on the interval [—ir/ui,ir/cu].
o
When projecting a given function orthogonally to the subspace of functions (3), the key concept is the set of Fourier coefficients an and bn, n6N,
7.B.3. Find the Fourier series for the periodic extension of the function
(a) g(x) = 0, x £ [—7r, 0),   g(x) = sinx, x G [0,7r);
(b) g(x) = \x\, x e [-7r,7r);
(c) g(x) = 0, x e [-1, 0),   g(x) = x + 1, x e [0,1).
Solution. Before starting the computations, consider the illustrations of the resulting approximation of the given functions. The first two display the finite approximation in the cases a) and b) up to n = 5, while the third illustration for the case c) goes up to n = 20. Clearly the approximation of the discontinuous function is much slower and it also demonstrates the Gibbs phenomenon. This is the overshooting in the jumps, which is proportional to the magnitudes of the jumps.
The case (a). Direct calculation gives (using formulae from 7.1.6)
7T 0 7T
ao   =\ J 9{x)&x = \ / 0 da; + -\ Jsinx dx
— TV —TV 0
1 r iir 2
= -  — COS a n = —,
7T   L JU 7T '
The second statement is called Bessel's inequality and it is an analogy of the finite-dimensional proposition that the size of the orthogonal projection of a vector cannot be larger than the original vector. The equality from the third statement is called Parseval's theorem and it says that if a given vector does not decrease in length by the orthogonal projection onto a given subspace, then it belongs to this subspace.
On the other hand, the theorem does not claim that the partial sums of the considered series need to converge point-wise to some function. There is no analogy to this phenomenon in the finite-dimensional world. In general, the series F(x) need not be convergent for any given a, even under the assumption of the equality in (3). However, if the series Er^=i lc«l converges to a finite value, and if all the functions /„ are bounded uniformly on 7, then, the series F(x) — J2n°=i cnfn(x) converges at every point a. Yet it need not converge to the function g everywhere. We return to this problem later.
The proof of all of the three statements of the theorem is similar to the case of finite-dimensional Euclidean spaces. That is to be expected since the bounds for the distances of g from the partial sum / are constructed in the finite-dimensional linear hull of the functions concerned:
Proof of theorem 7.1.5. Choose any linear combination / = En=i anfn and calculate its distance from g. We obtain
^2anfn\\2=       9W-^a„/„(i) dx
71=1 J 11 77=1
b k k
(ä(X) - Yanfn(x))(g(x) - ^Sn/„(l)) dx n=l n=l b rb
\g(x)\2 dx -      y^jg{x)anfn{x) dx-ja 77=1
/b    k pb       k n
'^2anfn(x)g(x) dx + / ^]ajn(: -- 77=1 Ja 77=1
l~Yl ancn\\fn\\2-^2ancn\\fn\\2+^2al\\fn\
77=1
k
77=1
77=1
llsll2 + XI H/rel|2((Cre ~ fl77)(c„ - an) - \cn\ 77=1
k
\\g\\2 + J2\\fn\\2{\cn - an\2 - \cn\2).
Since we are free to choose an as we please, we minimize the last expression by choosing an = cn, for each n. This completes the proof of the first statement. With this choice of an, we obtain Bessel's identity
k k
\\9 -J2cnfn\\2 = \\g\\2 -J2\cn\2\\fn\\2.
456
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
in   = ^ / g{x)cos(nx) Ax
— tt
0 tt
= ^/0da;+1:Jsinxcos(nx) dx
^ J sin((l + n)x) + sin((l — n)x) d x j_
2tt
cos((l+n)x)	cos((l—n)x)
1+n	1-n
cos((l+n)7r)	cos((l — n)Tr)
1+n	1-n
	i 1
2tt
J_ ( (-!)" + ! , (-!)"+! 2ir ^ 1+n
1  ^(-1)" + 1 1-n2
+
1+n       1—n 1
1-n
1
2 '
ai    =     /si11^1) da; = 0
tt 0 7T
&i   = ^ f g(a;) sin a; d a; = i J 0 d a; + i J sin2 a; d x
— tt —tv 0
= iJl-cos(2a:)da: = i[a:-^
0 L
7T
&n   =i/ g(x)sin(nx) da;
— tt
0 tt
= ^:/0da;+1:Jsinxsin(na;) dx
-tt 0 tt
= "kir S cos((l — n)x) — cos((l + n)x) d:
_ 1     sin((l—n)x) sin((l+n)x)
— 2tt [       1-n 1+n
for n G N \ {1}. Thus, we arrive at the Fourier series
1 sin a; 1 v-^ / i — - + — +
n=2 v
tt
-— cos(7ia;J
Since (—1)™ + 1 = 0 when n is odd, and = 2 when n is even, we can put n = 2m to obtain the series
1     sin a;     2 ^-^
•77   ~l 9~ T ^
ä(2r
COS 27713;
7T z—'   4m2 — 1
m=l
The case (b). The given function is of a sawtooth-shaped oscillation. Its expression as a Fourier series is very important in practice. Since the function g is even on (—tt, it), it is immediate that bn = 0 for all n e N. Therefore, it suffices to determine an for n e N:
ao = ^ J5(a;)da; = |Ja;da; = | [fl* =
— 7T 0 L JO
For other n e N, use integration by parts, to get
a„   = — J g{x) cos(7ia;) d x = — J a; cos(7ia;) d a;
— tt 0
= - [- sinfna;)]^--— f sin (nx) Ax
tt L n        ^ 0       n-7T J ^ '
= ^[cos(7ia;)]7;=^0((-ir-l)-So an = — -4- for 7i odd, an = 0 for ti even.
Since the left-hand side is non-negative, it follows that:
k
Ec'ii/™n2<ii3ii2-
71=1
Let k —> oo. Since every non-decreasing sequence of real numbers which is bounded from above has a limit, it follows that
OO 71=1
which is Bessel's inequality.
If equality occurs in Bessel's inequality, then statement (3) follows straight from the definitions and the Bessel's identity proved above. □
An orthogonal system of functions is called a complete orthogonal system on an interval I = [a,b] for some space of functions on I if and only if Parseval's equality holds for every function g in this space.
7.1.6. Fourier series. The coefficients cn from the previous ^Qt.fi theorem are called the Fourier coefficients of a given function in the (abstract) Fourier series. The previous theorem indicates that we are able to work with countable orthogonal systems of functions /„ in much the same way as with finite orthogonal bases of vector spaces.
There are, however, essential differences:
• It is not easy to decide what the set of convergent or uniformly convergent series
F(x) = ^2 Cnfn
looks like.
• For a given integrable function, we can find only the "best approximation possible" by such a series F(x) in the sense of L2-distance.
In the case when we have an orthonormal system of functions /„, the formulae mentioned in the theorem are simpler, but still there is no further improvement in the approximations.
The choice of an orthogonal system of functions for use in practice should address the purpose for which the approximations are needed. The name "Fourier series" itself refers to the following choice of a system of real-valued functions:
Fourier's orthogonal system
1, sin a;, cos a;, sin 2a;, cos 2a;,
smni, cosTia;,
An elementary exercise on integration by parts shows that this is an orthogonal system of functions on the interval
[-1", 1"]-
These functions are periodic with common period 2tt (see the definition below). "Fourier analysis", which builds
457
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
This determines the Fourier series of a function of sawtooth-shaped oscillation as
TT       2^/(-l)"-l        . ,
- + -> -5-cosfna;)
2     TT ^ \      n2 '
TT 4 cos((2n — l)a;) _ 9 w
TT ^ (2n-l)2
77=1 V !
TT     4 / cos(3a;) cos(5a;)
---cos x H---—- H--+ '
2    f 32 52
This series could have been found by an easier means, namely by integrating the Fourier series of Heaviside's function (see "square wave function" in 7.1.9 ).
The case (c). The period for this function is T = 2, and so cj = 2tt/T = tt. Use the more general formulae from 7.1.6, namely
xo+T 1
ao    = T   I d(x) da; = J g(x) Ax
x0 —1 0 1
= / Odx + j(x + l) dx = §, -1 o
x0+T 1
an   = if   J g(x) cos(nux) dx = J g(x) cos(mrx) dx
x0 —1 0 1
= J" 0 d a; + J(a; + 1) cos(n7ra;) dx = ^   I . -1 o
x0+T 1
bn     - 2
_ l-2(-l)
777T
T   J  g(x) sm(nwx) dx = J g(x) sm(mrx) dx
xo —1 0 1
= J" 0 d a; + J(a; + 1) sin(n7ra;) d x -1 o
The calculation of a0 was simple and needs no further comment. As for determining the integrals at an and bn, it sufficed to use integration by parts once. Thus, the desired Fourier series is
I + g (izir^i cos{n7TX) + iz2tn
4 \       nA7TA V / 777T
j(mrx)
Some refinements of the expression are available. For instance, for n e N,
an = — ^f5" f°r n °dd,   an = 0 for n even,
and, similarly,
bn = ^ for n odd,   bn = - ^ for n even.
□
In the next exercise we show that the calculation of the Fourier series does not always require an integration. Especially in the case when the function g is a sum of products (powers) of functions y = sin(ma;), y = cos(nx) for m, n e N, one can rewrite g as a finite linear combination of basic functions.
upon this orthogonal system, allows us to work with all piece-wise continuous periodic functions with extraordinary efficiency. Since many physical, chemical, and biological data are perceived, received, or measured, in fact, by frequencies of the signals (the measured quantities), it is really an essential mathematical tool. Biologists and engineers often use the word "signal" in the sense of "function".
Periodic functions
A real or complex valued function / defined on R is called a periodic function with period T > 0 if f(x + T) = j(x) for every
It is evident that sums and products of periodic functions with the same period are again periodic functions with the same period.
We note that the integral f*°+T f(x) dx of a periodic function / on an interval whose length equals the period T is independent of the choice of xq £ R. To prove it, it is enough to suppose 0 < x0 < T, using a translation by a suitable multiple of T. Then,
xo+T [-T j-xo+T
f(x) dx =      f(x) dx + f(x) dx
xo Jxo JT
T m:o i-T
f(x)dx+ I    f(x)dx= I f(x)dx
xo
0
Fourier series The series of functions
o
F(x) =     +      (an cos{nx) + bn sin(na;))
77=1
from the theorem 7.1.5, with coefficients
1	rXo+2-K
TT	J
	J xo
1	rxo+2-K
TT	J x0
g(x) cos(nx) dx, g(x) sin(na;) dx,
is called the Fourier series of a function g on the interval
[a;0,a;o + 2tt}.
The coefficients an and bn are, called Fourier coefficients of the function g.
If T is the time taken for one revolution of an object moving round the unit circle at constant speed, then that constant speed is a; = 27r/T. In practice, we often want to work with Fourier series with an arbitrary primary period T of the functions, not just 27T. Then we should employ the functions cos(a;?ia;), sm(uinx), where cu = 1^. By substitution t = cux, we can verify the orthogonality of the new system of functions and recalculate the coefficients in the Fourier series F(x) of a function g on the interval [a;0, x0 + T]:
458
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.B.4.   Determine the Fourier coefficients of the function
oo
(a) g(x) = sin(2a)cos(3a), x £ [-tt.tt]; f{x) = ^o+ J-(an cos(nc,a) + bn sin(nc,a)),
(b) g(x) = cos4 x, x £ [—7r,7r]. ^ n=i
Solution. Case (a). Using suitable trigonometric identities,
which have values
x0+T
sin(2a) cos(3a) = l(sin(2a + 3a) + sin(2a - 3a)) an = ^ J g(x) cos(no;a) dx,
1   •   / r   \      1   • 1 Jxa
2
It follows that the Fourier coefficients are all zero except bn = f I        9^ sm(n^x) dx
for&i = -1/2, b5 = 1/2.
7.1.7. The complex Fourier coefficients. Parametrize the Case (b). Similarly, from unit circle in the form:
cos4a   = (cos2a)2 = (i±£^M)2
= i(l + 2cos(2a)+cos2(2a)) 6    =cosut + lsmut-
dx = /   e^m~n> dx
- (l + 2 cos(2a) + 1+co^4xS>) For all integers m, n with m ^ n
= | + 1 cos(2a) + 1 cos(4a), i£R. Hence a0 = 3/4, a2 = 1/2, a4 = 1/8, and the other coefficients are all zero. □ _    1    ri(m-n)xi^ _q
7.B.5.   Given the Fourier series of a function / on the in-  Thus for m    n, the integral (etmx, emx) = 0.
W^T> terval [_7r' ^ with coefficients am, K, meN, Fourier's complex orthogonal system
n £ Z+, prove the following statements:
(a) If /(a) = /(a + it), x £ [-tt,0], then
a2k-i = &2fc-i = 0 for every fc £ N. Note that if m — n, then
dx = /    da = 27T.
(b) If f(x) = -/(a + 7r), x £ [-7r, 0], then a0 = a2fc = &2/t = 0 for every k £ N.
Solution. The case (a). For any k £ N, the statement can be        Tne orthogonality of this system can be easily used to recover the orthogonality of the real Fourier's system: Rewrite proved directly by calculations, but we provide here a concep-        , .. F J J F   the above result as
tual explanation.
The definition of the function / ensures it is periodic with
(cos mx + i sin mi) (cos nx — i sin nx) dx = 0
period 7t. Thus we may write its Fourier series on the shorter By expanding and separating into real and imaginary parts interval [ — |, |] as follows we get both
oo
f(t) =--h >   (an cos(2ni) + bn sin(2ni)). , . , , „
2     z—' /    cos rax cos nx + sin mx sin na da = U
Clearly this must be also the Fourier series of the same function / over the interval [-tt, tt] and so the claim is proved. /   (sin mx cos nx - cos ma sin na) da = 0
Alternatively, if *"
^ By replacing n with —n, we have also
/(a) = a0/2 + y^(a„cos(na) + &„sin(na)) /"r .
z—' /   (cosmacosna — smmasmnx) ax = U
n=l I !
J —TV
then
/TV (sin mx cos nx + cos ma sin nx) dx = 0
n_1 and hence, with m n,
oo ^.
= a0/2 +(-l)ny2(anCos(nx) + bnsin(nx)) /   cos ma cos na da = 0
The two series are the same only when the odd coefficients are zero.
sin ma sin nx dx = 0
459
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The case (b). Similarly if
oo
/(a) = ao/2 +
an cos(nx) + bn sin(na))
77=1
then — f(x + tt)
oo
= — ao/2 — X] (a™ cos(na; + n7r) + bn sin(na —\-mr))
77=1
oo
= — ao/2 + (—X] (a« cos(na) + bn sin(na)).
77=1
The two series are the same only when the even coefficients are zero. □
Complex Fourier series. It is sometimes convenient (and of-C^J'V ten easier) to express the Fourier series using the £\f^(c complex coefficients cn instead of the real coeffi-W    cients an and bn. This is a straightforward consequence of the facts
emu)x _ COs(no;a;) + i sin(nü;a)   or, vice versa
cos(riü;a) =   (e™^ + e-«™*)
sin(na;a) = j.^™* -e~m"x).
The resulting series for a real or complex valued function g on the interval [—tt, tt] is F(x) = E77L-00 c« einx wml
cn = ±- j\-mxg(x)Ax.
See the explanation in 7.1.7. We need just one formula for Cn, rather than one for an and another one for bn.
7.B.6. Compute the complex version of the Fourier series F(x) of the 27r-periodic function g(x) denned by g(x) = 0,
if — tt < x < 0, while g(x) = 1 if 0 < x < tt.
Solution. We have for n^O,
while cq = ^ /J1 d x = 1/2. So
sin mx cos     da = 0
which proves again the orthogonality of the real valued Fourier system.
Note the case m = n > 0, when
! nx dx = II cos(na) ||| = tt,
sin2 na da = | sin(r
If n = 0, then ||1||| = 2tt.
For a (real or complex) function f(t) with —T/2 < t < T/2, and all integers n, we can define, in this context, its complex Fourier coefficients by the complex numbers
l rT/2
cn = - I       f(t) e~^nt dt.
-T/2
T
The relation between the coefficients an and bn of the Fourier series (after recalculating the formulae for these coefficients for functions with a general period of length T) and these complex coefficients cn follow from the definitions. c0 = ao/2, and for natural numbers n, we have
Cn       2 (an      ibn),     C—n       2 (^77 ^ ^bn) •
If the function / is real valued, cn and c_„ are complex conjugates of each other.
We note here that for real valued functions with period 27r, the Bessel inequality in this notation becomes
00 ^
(l/2)|oo|2 + Xl(K|2+|bn|2)<||/||22=- /
77=1 71 J~7i
f(t)\2dt.
- OO —1
F(x) = - + Y.c^emx+ E c^mx
n=l n= — oo
_ 1    y 1 (1 - (-1)")  ing     yy    1 (!-(-!
2        ' 27T m
71 = 1
27T m
We have expressed the Fourier series F(t) for a function f(t) in the form
00
F(t) = Cne^nt.
n= — 00
In both cases of real or complex valued functions, the corresponding Fourier series can be written in this form. In general, the coefficients are complex. We return to this expression later, in particular when dealing with Fourier transforms.
For fixed T, the expression u = 2tt/T describes how the frequency changes if n is increased by one.
We wish to show that the Fourier orthogonal system is complete on 5° [—7r,7r]. This needs a thorough technical preparation. So now, we only formulate the main result, and add some practical notes. The proof of the theorem will be discussed later, see 7.1.15.
7.1.8. Theorem. Consider a bounded interval [a, b] of length T = b — a. Let f be a real or complex valued function in S1 [a, b] (i.e. a piecewise continuous function with a piece-]^isfnmntinuous first derivative), extended periodically on R. Then:
460
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
In the second of the sums, replace n with —n. The second
sum is then J2n°=i
oo      1 (l-(-l)")
so that
1      1 00
^> = \ + hY.
71=1
n=
l   l ^
= 2 + .^
n=l
(l-(-i)")
(l-(-i)")
2i sin(nx)
(i-(-i)")
sin(na;).
The terms when n is even are all 0. For n odd, put n = 2m—1 to obtain
2 ' tt ^ V 2m~ 1
1     9 00
m=l
i(2m — l)x
1     2 / sin 3a;     sin 5a;
= 2+^{SmX + — + — + • Clearly, this is similar to the one problem of 7.1.9 which also could have been used directly to obtain the result just by an easy transformation. □
7.B.7. Expand the following functions g(x) on the interval [—tt,tt] into the Fourier series, in the form
(i) g(x) = x, if —tt < X < it. (ii) g(x) = \x\, if —tt < X < it.
(iii) g(x) = X2, if —tt < X < tt.
(iv) g(x) = Hi—1/2 <x< 1/2 and 5(2;) = 0 otherwise.
(v) g(x) = 0 if —tt < x < 0, and g(x) = x if 0 < x < tt.
o
7.B.8. Repeat the task from the previous problem with the functions g denned again by the formulas (i) through (v), but taken on the interval 0 < x < 2ir. Q
7.B.9.   Determine the convergence and uniform conver--0f> gence of the Fourier series for the function C»2r5/rT g(x)=e~x, ie[-l,l).
Solution. It is not necessary to calculate the corresponding Fourier series if we wish only to check convergence (for the similar calculation in the case of function ex see 7.B.13). Define the function s(x) on R with period T = 2 as follows:
s(x) = g(x) = e x, xe (-1,1),
g(-l)+ lim g(x) , . x-s-l- e-r-e
s(!) = -Ö- = 9~
(1) The partial sums of its Fourier series converge point-wise to the function
9(x) = l{lim f(y) + lim f(y))-
(2) Moreover, if f is a continuous periodic function with a piecewise continuous derivative, then the pointwise convergence of its Fourier series is uniform.
(3) The L2-distance ||sjv — f\\2 converges to zero for N —> oo.
7.1.9. Periodic extension of functions. The Fourier series 5-,, converges, of course, outside the original interval [—T/2, T/2], since it is a periodic function on R. The Heaviside function g is denned by
-1 if      - tt < X < 0,
1 if     0 < X < tt
(We do not need to define the values at zero and at the end points of the interval, because these do not effect the coefficients of the Fourier series.) The periodic extension of the Heaviside function onto all of R is usually called the square wave function.
Since g is an odd function, the coefficients of the functions cos(na;) are all zero. The coefficients of sin(na;), are given by
i r..... 2
g{x) sin(na;) dx
sin(na;) dx
= (-!)")■
Thus the Fourier series of g is
4
sin(a;) + — sin(3a;) + — sin(5a;) + .
The partial sums of its first five and fifty terms, respectively, are shown in the following diagrams.
If the interval [—T/2, T/2] is chosen for the prime period T of such a square wave function, the resulting Fourier series is
4 / 1 1
g(x) = — I sin(o;a;) + — sin(3o;a;) + — sin(5o;a;) + .
tt \ 3 5
The number cu = ^ is also called the phase frequency of the wave.
As the number of terms of the series increases, the approximation gets much better except in a (still shrinking)
461
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
This function is the sum of the Fourier series in question, cf. Theorem 7.1.8. In other words, the Fourier series converges to the periodic function s(x). Moreover, this convergence is uniform on every closed interval which contains none of the points 2k + 1, k G Z. This follows from the continuity of the functions g and g' on (—1,1). On the other hand, the convergence cannot be uniform on any interval (c, d) such that [c, d] contains an odd integer. This is because the uniform limit of continuous functions is always a continuous function, and the periodic extension of s is not continuous at the odd integers. Thus, the series converges to the function g on (—1,1), yet this convergence is uniform only on the subintervals [c, d] which satisfy the restriction — 1 < c < d < 1. □
7.B.10. Determine the cosine Fourier series (see the definitions at the end of the paragraph 7.1.10) for a periodic extension of the function
g(x) = 1, x G [0,1),   g(x) = 0, x G [1.4).
Determine also the sine Fourier series for
f(x)=x-l, xe(0,2),   f(x) = 3-x, xe[2,4).
Solution. We have already encountered the construction of a cosine Fourier series in 7.B.4(b) and also 7.B.3(b). It is the case of the Fourier series of an even function. Therefore, the first thing we to do is extend the definition of the function g to the interval (—4,0) so that it is even. This means
g(x) := 1 for x G (-1,0),   g(x) := 0 for x G (-4,-1].
Now we can consider its periodic extension onto the whole R with period x0 = —4, T = 8 and cu = n/4.
Necessarily bn = 0 for all n G N in a cosine Fourier series. We determine the Fourier coefficients an, n G N
xo+T 1 1
ao   =T   J  9(x) dx = | J 1 dx = 1 Jl dx = i,
x0 xo+T
0
J g(x) cos (nujx) d x = ^ J <
_2 T
x0
^-sin 22
717T 4
d x
We use
(1)
f(x) dx = 2j f(x) dx,
-a 0
which is valid for every even function / integrable on the interval [0, a].
The expression sin(n7r/4) is conveniently left as is, rather than the alternative of sorting out when it attains which
neighbourhood of the discontinuity point. There, the maximum of the deviation remains roughly the same. This is a general property of Fourier series and it is called the Gibbs phenomenon.
In accordance with 7.1.8(1), the Fourier series of the Heaviside function converges to the mean of the two onesided limits at the points of discontinuity.
Of course, we cannot expect that the convergence of Fourier series for functions g with discontinuity points be uniform (then, the function g would have to be continuous itself, being a uniform limit of continuous functions). However, the converengence is uniform on any subinterval where the original function is continuous.
7.1.10. Utilizing symmetry of functions. We consider the problem of finding the Fourier series of the function f(x) = x2 on the interval [0,1]. If we just Jk_0# periodically extend this function from the given -ifcf^i-j-. interval [0,1], the resulting function would not be continuous, and so the convergence at integers would be as slow as in the case of a square wave function. However, we can easily extend the domain of / to the interval [—1,1], so that f(x) = x2 is an even function for — 1 < x < 1. If we then extend periodically, the result is continuous. The resulting Fourier series then converges uniformly, and since then / is even, only the coefficients an are non-zero.
For n > 0, iterated application of integration by parts yields:
an = — f  x2 cos^^y^) dx = 2 f x2 cos(irnx) dx 2 J-i Jo
= — (-it.
The remaining coefficient is
2/2 f     2 2
an = - /   x dx = 2 I   x dx = —. 2 J-i Jo 3
The entire series giving the periodic extension of a;2 from the interval [—1,1] thus equals
1     4 ^,
(-ir
cos(Tmx)
By the Weierstrass criterion, this series converges uniformly. Therefore, g(x) is continuous. By theorem 7.1.8, g(x) = f(x) = x2 on the interval [—1,1]. Thus our series approximates the function a;2 on the interval [0,1] better (ie faster) than we could achieve with the periodic extension of the function from [0,1] interval only. If we put x = 1 and rearrange, we obtain the remarkable result
El 7T
6
We proceed with a further illustration. Because of the uniform convergence, we can differentiate term by term and
462
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
of its five different values. Thus we write the cosine Fourier series in the form:
oo
Z + E (^sin^cos^).
71=1
The sine Fourier transform of the function can be determined analogously from the odd extension of the given segment. Again, T = 8 and cu = tt/4 for the function /. This time it is the coefficients an,n G Nu{0}, which are zero. To find the remaining coefficients, use integration by parts and (1) (the product of two odd functions is an even function). For all n G N
x0+T
bn   = j!   J   f(x) sin {nujx) Ax
xo
/ 2 4
= 11 J(x-l)sm^ Ax- J(x-3)sin^ Ax = [(l-x)^cos^]l+[^sin^]20
= £((-1)"-1) +        sin ¥■
Immediately, bn = 0 when n is even. So the sine Fourier series can be written
oo /
E ((-!)"-!)+        sin ^) sin
77=1 \
calculate the Fourier series onthe interval — 1 < x < 1, for the function x
2 ^ . . ,
x = — > -sin(irnx .
it ^ n
77=1
This series cannot converge uniformly since the periodic extension of the function x from the interval [—1,1] is not a continuous function. However, it does converge pointwise —1 < x < 1, (see our reasonings about alternating series in 5.4.6), thus we have verified the above equality.
Similarly, we can integrate the Fourier series of x2 term by term obtaining
1 ,    2       4^ (-1)™ .
-x H--77 > -ö—sin (irnx)
77=1
7TJ
This is valid for —1 < x < 1. It is not valid for other values of x, since the series is periodic, but the other two terms are not. Of course, we may substitute the above Fourier series of the function x and thus obtain the Fourier series for a;3 on the intervat [—1,1] this way.
In this context, we use the following terminology:
The sine and cosine Fourier series
E
(277-l)ir + (277-l)27T2
(277-1)
□
7.B.11.   Express the function g(x) = cosx,x G (0,7r), as the cosine Fourier series and the sine Fourier series.
Solution. Start with the sine series. This is the odd extension of the cosine function.
Necessarily, an = 0, n G N U {0} for the sine series.
TV TV
bi =   -\ J cos x sin x A x = ^ J sin(2a;) Ax = 0. For all n
o o
bn =   ^ J cos x sia(nx) Ax o
TV
=   —Jsin((n + l)x) + sin((n — l)x) Ax 71 o
_ 1    |"CQS((77+1)X) C0S((77-1)X)17f  _   2"((~1)" + 1)
7T 77+1
(772-1)tT
So bn = 0 for odd neN, and bn = (-ra24_I'i)7r f°r even n-We conclude that
OO      , \
cosa; = 2~2 [(aJ-i)^ sin(2na:)j ,    x G (0,tt).
Of course, for the even extension of the function g in question,
g(x) = cosa;,    x G (—7r,7r). Thus the right hand side is the uniquely given cosine Fourier
series.
□
For a given (real or complex) valued function / on an interval [0, T) of length T > 0, the Fourier series of it's even periodic extension (with period 2T) is called the cosine Fourier series of /, while the Fourier series of the the odd periodic extension of / is called the sine Fourier series of the function /.
7.1.11. General Fourier series and wavelets. In the case of a general orthogonal system of functions /„ and the series generated from it, we often talk about the general Fourier series with respect to the orthogonal system of functions /„.
Fourier series and further tools built upon them are used for processing various signals, illustrations, and other data. In fact, these mathematical tools also underpin many fundamental models in science, including, for example, modeling of the function of the brain, as well as much of theoretical physics.
The periodic nature of the sine and cosine functions used in classical Fourier series, and their simple scaling by increasing the frequency by unit steps, limit their usability. In many applications, the nature of the data may suggest more convenient and possibly more efficient orthogonal systems of functions.
Requirements for fast numerical processing usually include quick scalability and the possibility of easy translations by constant values. In other words, we want to be able to zoom quickly in and out with respect to the frequencies and, at the same time, to localize in time.
463
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.B.12.   Write the Fourier series of the 7r-periodic func--00 tion which equals cosine on the interval [♦^J/n (-71-/2,71-/2).      Then write the cosine Fourier series of the 27r-periodic function
y = | cos x |.
Solution. We are looking for one Fourier series only, since the second part of the problem is just a reformulation of the first part. Therefore, we construct the Fourier series for the function g(x) = cosx,x £ [—7r/2,7r/2]. Since g is even, bn = 0, n £ N. We compute
rr/2
a„   = ^   J  cosxcos(2tix) da;
-rr/2 ir/2
= 1   I   2(cos((2n + l)x) + cos((2n- l)x)) d
-ir/2
_ l_ |"sin((2re+l)3:)
2n+l
= 2 /i^in
7T ^ 2n+l - 4 (-1)
+
+
sin((2ra-l)j:) 2n-l
-it 12 —ir/2
4n2-l
for every n6N, The calculation is also valid for n = 0, thus ]^ a0 = 4/7r. The desired Fourier series is
f + . £ fetrCos(2na;)
□
7.B.13.   Expand the function g(x) = ex into
(a) a Fourier series on the interval [0,1);
(b) a cosine Fourier series on the interval [0,1];
(c) a sine Fourier series on the interval (0,1].
Solution. Note the differences between the three cases as shown in the diagrams. The approximation differs in relation to the continuity. The first diagram uses n = 5, the second uses n = 3, while the last diagram uses n = 10.
Fast scalability can be achieved by having just one wavelet mother function1 t/j, if possible with compact support, from which we create countably many functions j, k £ Z, by integer translations and dyadic dilations:
ipj,k(x) = Vl2^{Vx - k).
It is wise to rescale and choose t/j with L2-norm equal to one (as a function on R). Then the coefficients 2J/2 ensure that the same is true for all ipj^-
Of course, the shape of the mother wavelet t/j should match and cover all typical local behaviour of the data to be processed. We say that ip is an orthogonal mother wavelet if the resulting countable system of functions ipjk is orthogonal and, at the same time, it is "reasonably dense" in the space of functions with integrable squares. We come to this concept in more detail later on.
x The effectivity of the wavelet analysis is another issue which needs further ideas and concepts to be built in. We do not have space here to go into details, but the readers may find many excellent specialized books in this fascinating area of applied Mathematics. Here we consider one simple example.
7.1.12. The Haar wavelets. Perhaps the first question to jgj.        start with is, how to effectively approximate any ' given function with piecewise constant ones.
For various reasons, it is good if our mother wavelet ip has zero mean, too. Thus we want to consider an analogue of the Heaviside function
ip(x) =
1
x e [0,1/2)
-1   x £ [1/2,1).
As a straightforward exercise we may check that, indeed, the resulting system of functions ipj^ is orthonormal. Another exercise shows, using finite linear combinations of these functions, that we may approximate any constant function with given precision over a bounded interval. In an exercise we shall see ?? that this already verifies the density properties required for the orthogonal mother wavelet functions.
Now we consider the question of effective treatment. Notice that we can also use the characteristic function ip of the interval [0,1) and write
t/j(x) = <p(2x)-<p(2x-l)
1
71
V2^i,o(a
1
71
V2^i,i(a
l-T
The roots of wavelets go back to various attempts, how to localize the basic signals in both time and frequency, with diverse motivations from engineering and other applications. The name wavelet seems to be related to the idea of having a wave similar signal which begins and ends with zero amplitude. Since late 1970's, these attempts were related to many names (e.g. Morlet, Meyer) and the wavelet theory became the main tool in signal analysis. . Of course, first examples of wavelets are much older, the Haar's construction goes back to 1909. Actually, many of the wavelet types do not represent orthogonal systems of functions, they rather share the idea of a combination of the high pass filters and low pass filters. The reader is advised to consult extremely rich literature, if interested in more details
464
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
	A	
V		V
We use the formulae (a,ß GR)
/ex (a sin (ax) + cos (ax)) ex cos (ax) dx = —y--V   '  ,--—^ + C, 1 4- a
r        ,   N          exfsin (ßx) — ßcos (ßx)) (2) y e* sin {ßx) dx =    1        (+p-+ C'
both of which can be obtained by two integrations by parts. Actually, the second one was computed in detail in 6.B.5(d). We obtain
(a) ar
o
I *?t»tt s;in IO.tiitt\ -4-r.ns I'^nirr'! I
2 J ex cos (2mrx) dx
0
(27777 sin(2n7rx)+cos(2ri7rx)) 1+4t72tt2
T&&> nGNU{0},
1
2 J ex sin(2n7ra;) da; o
ex ^sin(2n7rx) — 2mr cos(2t77t:e)) 1+4t72tt2
1
(b)   an   = 2 J ex cos (mrx) d x o
= 2
e1 ( n7r sin(n7rx)+cos(n7rx)
l+n27T2
= nGNU{0}; l
(c)   bn   = 2 J ex sin (mrx) d x o
= 2
e1 ( sin(n7rx)—n7r cos(n7rx)
l+n27T2
= 2n;(l+( 1);+le) ■
Substitution then yields the corresponding Fourier series for
9(x):
(a) e-l + 2(e-l) E ^
n=l
oo ,
+^(i-e)E=iT^1;
n=l
(b) e-1 + 2 E (("irr;^^s(n,r3!);
71=1
/ x 0 n(l+( — l)7l4"1e) sin(777rx) (C)     27T > J —i---—j-4---1
The function <p plays the role of the father wavelet function and it itself satisfies
p(x) = <p(2x)+<p(2x—l) = -—^=V2(pifi(x)+—\= V2p1^1(x)
This can be interpreted as differencing and averaging the two consecutive values at the half scale.
With these properties, there is no need for an explicit analytic form of i/j and ip, since we can find their values recurrently in all dyadic points x. Indeed,
p(2]-1n) = p(23n) + p(23n - 1).
The function ip has another useful feature. Namely we can obtain the unit constant function by adding all its integer translations
for all x G R.
Finally, nearly all the coefficients in the general Fourier series with the base ipj^ vanish for piecewise constant functions. On the contrary, the function ip "sees" the constants. In engineering terminology, this is an instance of the high pass filter and low pass filter.
7.1.13. Example. To illustrate the above considerations, we approximate the following function f(x) in R by the Haar wavelets,
0.3(a;+ 3), -2<x<-l 0.7(2;+ 1), 0<a;<l
The function in question is not periodic and could not be approximated well by classical Fourier series. The individual functions ipj^ from 7.1.11 have compact support, but in order to approximate constant or linear behaviour, we still need a large number of them. The following illustrations have been acquired in Maple working with indices \j\ < n and \k\ <n, the first one with n = 5, the second one with n = 10.
/(*)
□
The approximation on the sides of the interval is not as good as in the middle, because we do not include enough shifts, i.e. values of k. One of the motivations for the scaling and shifting in the construction of wavelets is the hope for a small amount of non-zero coefficients. But this does not mean that most of the coefficients would be zero. In our case the percentage of non-zero coefficients for n = {1, 2,..., 10} is 55.6,44, 38.8, 34.6,32.2,30.8,29.8,28.7,28.0,27.4.
465
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.B.14.   Express the function g(x) = tt2 — x2 on the interval
f^ffa*    t^ m me ^orm °^ a P°urier sefies- Using this expression, sum the two series
E
(-i)
71+i
oo ^ 5^ „2
that
3 + t2 E
(— l)71 cos -
2^2 +4 J2 2cos(raJ:) 3 n=l ™
x G
Of course, one can also calculate the Fourier series of the function g directly.
Substituting x = 0 and x = tt then gives
n=l n=l
and
i.e.
J_ — Hi 2^ n2 — 6 n=l
In other words,
2 _
tt
l2[l-¥ + h-T2+'
i
= 6(1 + ^ + ^ + ^ +
□
7.B.15.   Sum the series
E (2ti-1)2'
71 = 1
Solution. To determine the value of this series, one can successfully apply several known Fourier series. For instance, the Fourier series
oo
-7T       4 cos((2ti — 1)x)
2 ~ 7 1^       (2n-l)2 '
71=1
was calculated for the function g(x) = | x |, x G [—tt, it), see 7.B.3(b). Since this function is continuous on [—it, tt) and
— tt | = | tt \,
oo
1-1 = 1-. E
71 = 1
Substituting x = 0 gives
cos([2ti—l]x)
(271-1)2
a; G — 7T,tt\.
0 :
1 E
1
(271-1)
hence E
(271-1)
7.1.14. Concluding remarks. A series of famous wavelets Dm has been called after Ingrid Debauchies. They are constructed by very similar recurrent averadging and differencing relations based on certain natural requirements. Just as an indication, consider the slightly more general recurrent relations
71=1 71=1
Solution. We could take advantage of the function g being even, and calculate the non-zero coefficients an by integration by parts. However, in 7.1.10 the Fourier series for the function f(x) = x2 on the interval [—1,1] is derived. This proves the identity
m = * + & E t-1^"^, xe(-i,i),
71=1
valid also for x = ±1. By adding tt2 and rescaling, it follows
(p(x) = \/2     hkip(2x — k)
k=0 N
4>(x) = V2^2gkip(2x-k)
k=0
with yet unknown constants gk and hk. If we want the mother wavelet t/j(x—k) to have zero coeffients in the resulting series for all polynomials up to the order TV—1, then we must ensure that
/OO xkip(x) dx = 0 -oo
for all k = 0,1,..., TV — 1. Similar conditions determine the Debauchies wavelets.
The standards of JPEG2000 are based on similar wavelets and such techniques provide tools for professional compression of visual data in film industry, or the format DjVu for compressed publications.
In the diagram below, there are the Daubechies DA mother and father wavelets.
In real applications, the orthogonality of the mother wavelet can be relaxed. As long as the functions t/jk,i are linearly independent and generate the whole space of interest, we always get the dual basis with respect to the L2 inner product.
- Mather Wavelet|
Father Wavelet
7.1.15
.tál"
□
Proof of theorem 7.1.8 about Fourier series. We
return to the detailed proof of the basic properties of the classical Fourier series. The reader could enjoy reading first the general context of metrics and convergence which we will introduce later in this chapter. Thus, skipping the paragraphs up to 7.3.1, reading first there, and returning later might be a good idea. On the other hand, we do not need much from the general theory of metric spaces and so our considerations of various concepts of convergence in the proof could also be of assistance for the abstract developments later.
466
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.B.16.   Using the Fourier series of the function g{x) = ex, x G [0, 2tt), calculate J2n=i resolution. (See (1), (2))
«o   =i/Vda; = i(e2--l),
0
1   r   r        /     \   l 1 Te^ [cos(nx)+n sin(nx)l ^
an cos (na0 dx = £ 1+n2
0 L = e2"-l (l+n2)7r'
= ^ J ex sin (ns) d a; o
1  fe1 [sin(nx)— n cos(nx)] 1 _ n(e2,r — 1)
~~ 1+n2 j0    — — (l+n2)ir '
Therefore,
-i/i+E:
n=l
s(r)
-n sin(n
1+n2
x G (0,2tt).
However, no choice of x G (0,2ir) yields the series Z~2n°=i T+h^ on tne right-hand side. It would be obtained for x = 0. The periodic extension of g to R is not continuous at this point, so
2       = 2
2+E
n=l
cos 0—n sin 0 1+n2
hence it follows that
2
which can be refined to
2"-l — 2 ^ 1+n2 n=l
E
i
1+n2
Q-l)e2lr+7r+l 2(e2--l) •
□
A few more cases of interesting series of numbers are shown in the exercises in the end of this Chapter, starting at the page 502. We add one more such exercise which reveals a different strategy.
7.B.17.   Using Parseval's identity for Fourier's orthogonal i'/, system (part (3) of the theorem 7.1.5), verify that
E
(2n-l)4
Solution. It is imperative to choose an appropriate Fourier series. For instance, consider the Fourier series
TT _ 4_
2 TV
cos((2n — l)x) (2n-l)2
which we obtained for the function g(x) = \x\, x G [—7r, 7r) in7.B.3(b). Parseval's identity
2        oo oo xo+T
f+E«n+E&n = |    / {g{x)fAx n=l n=l x0
says, substituting for the a's and &'s from our particular series, that
We do not worry here about necessary conditions for convergence, and many other formulations can be found in literature. On the other hand, the statement of theorem 7.1.8 is quite simple and deals with many useful situations, as we have seen already.
Although we need only the L1 and L2 norms now, we should observe that for general 1 < p < oo, the formula
ll/llP:
b        \ VP
i/r
defines also a norm. See the definition in 7.3.1 and the paragraph on the Lp-norms and Holder inequality in 7.3.4 below. Moreover, there is the norm given by the suprema of values of / over the interval in question.
For the sake of simplicity, we always work in the space S^? or S\ with respect to the corresponding norm (which always makes sense there).
Holder's inequality (applied to the functions / and constant 1) yields the the following bound on 5° [a, b]. Namely, forp > 1 and 1/p + 1/q = 1,
11/9
b N 1/P
\f(x)\* dx
\f(x) \ dx < \b
< \b-a\^\\f\\p.
Replace / with fn — f- It is then clear from the above bound that Lp-convergence /„ —> / implies, for any p > 1, Li-convergence. (The terminology Lp-convergence is stronger than L1 -convergence is sometimes used). With a modified bound, we can derive an even stronger proposition, namely that on a bounded interval, Lg-convergence is stronger than Lp-convergence whenever q > p; try this by yourselves.
If uniform boundedness of the sequence of functions /„ is given, then there is a constant C independent of n, so that
ll/nll < C.
Then wecan assert that \ jn(x)—j(x)\ < 2C, and it then follows that Li-convergence implies Lp-convergence, since then
\f(x)-fn(x)\" dx =
\f(x) - Ux^Wfix)- fn(x)\dx <(2C)p-1 fb\f(x)-fn(x)\dx
which can be written
l|/-/n||p<(2C)1^||/-/rl||^.
It follows that the Lp-norms on the space S° [a, b] are equivalent with respect to the convergence of uniformly bounded sequences of functions.
The most difficult (and most interesting) problem is to prove the first statement of the theorem 7.1.8, which in the
467
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
2 so
(2n-l)4       TT J
dx = - fx
TV J
0
2 dx = 2f
E (2n-l)4
71=1   V 7
___!
3
tt
~2
tt
16
7T
96'
□
There are other ways of obtaining this result, see for example (3) at the page 502. We recommend comparing the solutions of this exercise to the previous one.
We started our discussion of Fourier series with the sim-S°°2> plest of periodic functions
f(i) = asin(o;i + 6)
for certain constants a, cu > 0, b e R. They appear as the general solution to the homogeneous linear differential equation
(1)
0
y +u y ■
which arises in mechanics by Newton's law of force for a moving particle. Recall the brief introduction to the simplest differential equations in 6.2.14 on page 407. Much more follows in Chapter 8.
We mention that the function / has period T = 2tt/u. In mechanics, one often talks about frequency 1/T. The positive value a expresses the maximum displacement of the oscillating point from the equilibrium position and it is called the amplitude. The value b determines the position of the point at the initial time t = 0 and it is called the initial phase, while cu is the angular frequency of the oscillation.
Similarly, the function z = g(t) describes the dependence of voltage upon time t in an electrical circuit with inductance L and capacity C and which is the solution of the differential equation
(2) z" + cu2z = 0.
The only difference between the equations (1) and (2) (besides the dissimilar physical interpretation) is the constant cu. In the equation (1), there is a;2 = k/ra where k is the proportionality constant and m is the mass of the point, while in the equation (2), there is cu2 = (L(7)_1.
We illustrate how Fourier series can be applied in the theory of differential equations. Consider only the non-homogeneous (compare to (1)) differential equation
(3) y" + a2y = f(x)
with y an unknown in variable 168, with a periodic, continuously differentiable function / : R —> R on the right-hand
literature is often refered to as the Dirichlet condition, which seems to have been derived as early as in 1824.
We begin by proving how this property of piecewise convergence implies the statements (2) and (3) of the theorem. Without loss of generality, we assume that we are working on the interval [—tt, tt], i. e. with period T = 2ir.
As the first step, we prepare simple bounds for the coefficients of the Fourier series. One bound is of course
1
|/(a;)| dx,
and similarly for all the coefficients bn. This is because both cos (2) and sin(2) are bounded by 1 in absolute value. However, if / is a continuous function in S1 [a, b], we can integrate by parts, thus obtaining
1 r
a~n(f) = -       f (x) cos(nx) dx
k J-TT
1 1 r
= —[j(x) s\a[nx)Y!_^--/    j'(x) sm[nx) dx
mr mr J_7V
= -bn(f').
n
We write an(/) for the corresponding coefficient of the function /, and so on.
Iterating this procedure, we really obtain a bound for functions / in Sk+1 [—tt, tt] with continuous derivatives up to order k inclusive
M/)l<^/_J/(fc+1)WI^
and similarly for &„(/).
Thus we can see that the "smoother" a function is, the more rapidly the Fourier coefficients approach zero. For sufficiently smooth functions /, the nfe-multiples of their Fourier coefficients an and bn axe, bounded by the Li-norm of their fc-th derivative .
Let / be a continuous function in the space S1 [a, b] such that the partial sums of its Fourier series converge pointwise to /. Then we can assert that
Sj\r(2) — f{x)\ =    _£ (a*: cos(fcr) + &fc sin(fcE))
k=N+l
00
< _>_ (M + I&fcl).
k=N+l
The right-hand side can further be estimated by the coefficients a'n and b'n of the derivative /'.By applying in succession the inequality above, then Holder's inequality for infinite series (with p = q = 2), and then with the arithmetic-geometric inequality, we obtain
468
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
side and a constant a > 0. Let T > 0 be the prime period of the function / and let its Fourier series on [-T/2, T/2] be known, i.e. we assume
(4)  /(*) = ^ + £
2tttix 2imx\ An cos ——--h Bn sin -
T
T )
7.B.18.
(1)
Prove that if the equation (3) has a periodic solution on R, then the period of this solution is also a period of the function /. Further, prove that the equation (3) has a unique periodic solution with period T if and only if
2nn
a for every n£N.
Solution. Let a function y = g(x), x e R, be a solution of the equation (3) with j(x) ^ 0 and with period p > 0. In order to substitute the function g into a second-order differential equation, its second derivative g" must exist. Since the functions g, g', g", ... share the same period, the function
g"{x) + a2g(x) = f(x)
is also periodic with period p. In other words, the function / is periodic as a linear combination of functions with period p. Thus, we have proved the first statement claiming that p = IT for a certain / G N.
Suppose that the function y = g(x), x e R, is a periodic solution of the equation (3) with period T and that it is expressed by a Fourier series as follows:
(2)
oo
g(x) = — + E] (an cos (umx) + bn sin (umx)),
71=1
where ui = 2n/T. If g satisfies the equation (3), it has a continuous second derivative on R. Therefore, for i£8,
oo
g'(x) =      {umbn cos (umx) — uman sin (umx)),
77=1
(3)
oo
g"(x) =      (—üj2n2an cos (umx) — ui2n2bn sin (una;)),
77=1
Substituting (4), (2) and (3) into (3) yields
oo
a2^ag +      ((—cü2n2an + a2an) cos (ncux) +
77=1
(—uj2n2bn + a2bn) sin (nuxj)
oo
= if     2~Z {An cos (ncjjx) + Bn sin (ncjx)).
77=1
It follows that
(4) a2y = y>    ±Sitis    a2a0 = A0,
k=N+l
\sN(x)-f(x)\<   J2   l(\a'k\ + \K
k2
k=N+l     ' xfc=iV+l
E (KI'+aKHbjfei + ibjfei
<
E
fc=iV+l
E (2K|2 + 2|^|
fc=iV+l
<V2{ J \dxY^=\\f\\2
Thus we have obtained not only a proof of the uniform convergence of our series to the anticipated value, but also a bound for the speed of the convergence:
sup\sN(x) - f(x)\ < f-^H/'lla) ' ~j=-
This proves the statement 7.1.8.(2), supposing the Dirichlet condition 7.1.8.(1) holds.
7.1.16. L2-convergence. In the next step of our proof, we derive L2-convergence of Fourier series under the condition of uniform convergence. The proof utilizes the common technique of approximation objects which are not continuous by ones which are. We describe it without further details. Interested readers should be able to fill in the gaps by themselves without any difficulties. First, we formulate the statement we need.
Lemma. Tfte subset of continuous functions f in 5° [a, b] on a finite interval [a, b] is a dense subset in this space with respect to the L2-norm.
Here "dense" means that for any g in 5° [a, b] and any e > 0, there is some continuous / satisfying ||/ — g\\2 < e. We deal with abstract topological concepts like this in the last part of this chapter.
The idea of the proof can be seen easily via the example of approximation of Heaviside's function h on the interval [—7r, 7r]. We recall that h(x) = —1 for x < 0, and h(x) = 1 for a > 0. For every S satisfying it > S > 0, we define the function fs as x/S for |a < S and f$(a) = h(x) otherwise. All the functions fs are continuous, in fact, piecewise linear. It can be calculated easily that 11 h—fs \ 12 —> 0 so that h can be approximated in L2 norm by a sequence of continuous functions.
All discontinuity points of a general function / can be catered for in exactly the same way. There are only finitely many of them, and so all of the considered functions are limit points of sequences of continuous functions.
Now, our proof is already simple because for the given function /, the distance between the partial sums of its
469
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
and for neN,
(5) (—cu2n2 + a2) an = An,    (—cu2n2 + a2) bn = Bn.
There is exactly one pair of sequences {an}nGNu{o}> satisfying these conditions if and only if
-uj2n2 + a2 = -        2 + a2 ^ 0   for every neN,
i.e., if (1) holds. In this case, the only solution of (3) with period T is determined by the only solution
A B
(6) an =-r-^-—-,   bn =-—- neN
of the system (5). We emphasize that we utilized the uniform convergence of the series in (3). □
7.B.19. Using the solution of the previous problem, find all 27r-periodic solutions of the differential equation
y" + 2y=E-^, xeR.
77 = 1
Solution. The equation is in the form of (3) for a = \[2 and the continuously differentiable function
/(*) = E Bjs£^,  x e K
77=1
with prime period T = 27r. According to problem 7.B.18, the condition \/2 <£ N implies that there is exactly one 27r-periodic solution. If we look for it as the value of the series
oo
^ + E (a77 cos (nx) + bnsm (11)),
77 = 1
we know also that (see (4) and (6))
ao = an = 0,    bn = ra2(21_ra2), neN. Thus, the given equation has the unique 27r-periodic solution
Et 77
iin(77x)
2(2-772) '
x e
□
C. Convolution and Fourier Transform
Convolution is a typical integral operation used for smoothing data. See 7.2.2 for the definition and basic comments. It is denned by the formula
{f*d){y)=        f{x)g(y-x) dx.
J — oo
The next problems introduces these features.
7.C.I. Find the convolutions f*g and / * h of the following functions and in each case check the "smoothing" of /.
Fourier series can be bounded by using a sequence of continuous functions fe in this way. (All norms in this paragraph are the L2 norms):
II/-*aK/)II < ll/-/e|| + ll/e-*Ar(/e)ll + ll*Ar(/e)-*Ar(/)ll and the particular summands on the right-hand side can be controlled.
The first one of them is at most e, and according to the assumption of uniform convergence for continuous functions, the second summand can be bounded also by e. Notice that the third term has the value of the partial sum of the Fourier series for f — f£. Thus,
\\f - fe - sN(f - fe)\\ < ||/-/e||.
Therefore by the triangle inequality,
IM/-/e)ll < IM/-/e)-/+/e||+ll/-/e|| < 2||/-/e||.
Altogether, ||/- sN(f)\\ < 4e.
This verifies the L2 convergence of the continuous functions sjv(/) to / which is we wanted to prove.
7.1.17. Dirichlet kernel. Finally, we arrive at the proof of the first statement of theorem 7.1.8. It follows from the definition of the Fourier series F(t) for a function f(t), using its expression with the complex exponential in 7.1.7, that the partial sums s/v (t) can be written as
sN
1 J^L rT'2
k=-NJ~1'2
f(x)e-
dx,
where T is the period we are working with and cu = 2ir/T. This expression can be rewritten as
T/2
(1) SN(t) =
where the function
KN(y)
T/2
Kjsi(t — x)f(x) dx,
Jujky
is called the Dirichlet kernel. The sum is a (finite) geometric series with common ratio eluly. By multiplying by eluly and then subtracting, we obtain
1 N ^VKN(y) = -
iui(k+l)y
T
e-iNuy _ ei(N+l)uy
Provided ujy is not a multiple of 27r, we continue to obtain
1  p-iNuiy _ i(N+l)uiy
Kjsf(y) =-----
1 _ e-i{N+l/2)uy +ei(N+l/2)uoy Qtujy/2 _Q—iujy/2
_ 1 sin((Ar + l/2)uy) T sin(o;y/2)
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
f(x) = sin(x) H— [sin(6a;)
5
9(x) = h{x) =
—e<x<e otherwise
1
--SI]
5
e > 0
i(60a
27TÖ-
/i, a e K, a > 0.
labels in drawings -proc ne, ale připsat rukou, popis je ale v textu stejné ?
Solution. The function g is chosen to provide the mean of / over (small) intervals of length 2e and it is normalized so that the integral of g over all of R is one. We should expect some smoothing of the oscillations of the function f(x). Drawing the resulting functions by Maple shows:
The graph depicted in the upper left is the original /, while the graph in the upper right is of g with e = 1/10. The two lower graphs show their convolution with (respectively) parameters for g selected as e = 1/10 and e = 13/50. It is straightforward to compute the convolution explicitly, too:
in which the key step was to multiply both the numerator and the denominator by - e-iu)v/2 . When y = 0, KN(0) = ±(2N + 1).
The last expression shows that (y) is an even function. By l'Hospital's rule, applied at y = 0, it is continuous everywhere. Since all the partial sums of the series for the constant function f(x) = l also equal 1, we obtain from the definition of the Dirichlet kernel, cf. (1), that
T/2
Kn(x) dx = 1.
T/2
In the case of periodic functions, the integrals over intervals whose length equals the period are independent of the choice of the end points. Hence, changing the coordinates, we can also use the expression
sN(x)
T/2 -T/2
KN(y)f(x + y) dy
for the partial sums.
Finally, we are fully prepared. First, we consider the case when the function / is continuous and differentiable at the point x. We want to prove that in this case, the Fourier series F(x) for the function / converges to the value f(x) at the point x. We have
sN{x) - f{x) = f'2 {f{x + y)- f(x))KN(y) dy.
J-T/2
The integrand can be rewritten
f(x + y)-fW sin{{N + imu]y) =
sin(o;y/2) U        ' ' UJ
= ipx(y)(cos(cjy/2) sm(Ncjy) + sin(o;y/2) cos(Ncjy)),
where
•r°x{y)
f(x + y)-f(x)
sin(o;y/2)
fory 7^ 0, while ipx (0) = 2f'(x)/ui. Since / is differentiable and continuous at the point x, the function ipx (y) is clearly continuous on the entire interval [—T/2, T/2] (use L'Hospital's rule to see it).
Now, we can rewrite the integral expression for s/v — /
as
2 fT/2
7f /      Oi(y) sin(Nuy) + ip2(y) cos(Nuy) dy
1  J-T/2
with the continuous and bounded functions
/ * g(y)
2e
sin(x) + - sm(6x)2 - - sin^clx   My) = ^{y) cos(ujy/2), Mv) = ^x(y)sin(ujy/2).
T
1
= — [— cosix) — 2el       K ' 30
sin(6a;) cos(6a;) + —x +
1
5~ ' 300
Similarly, the other function h is a typical smoothing function which gives much more weight to the values of / near the point y, and much less weight to the values of / further from y. This is the famous Gaussian function. We meet it frequently. Using Maple again, we see that the integral of h
/qq Thtts, we deal with the Fourier coefficients of the functions i ■' and ip2, which converge to zero. Hence
and
lim
lim
T/2 -T/2 T/2 -T/2
i/'i(y)sin(Ara;y) dy = 0,
4}2{y) cos(Nujy) dy = 0.
471
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
over R is one (we prove this in 13.2.8). It is not easy to find the convolution analytically, but Maple can do it approximately. The resulting diagrams are as follows.
As before, the graph depicted in the upper left is the original /, while the graph in the upper right is of h with fi = 0, and a = 1/10. Below that, their convolutions are shown with parameters /i = 0, tr2 = 1/60 and (lower right) fi = 0, a2 = 5/60.
□
7.C.2.   Determine the convolution f± * f2 where
fi(x)   =   -x fora^O
i i \   _   \x   for x £ [—1,1] ^X>   ~   \0 otherwise
Solution.
/oo /*°° 1
h{x)f2(t-x) dx =        -f2(t-x) dx. -oo J — oo X
Since j2(x) = 0 outside the interval [—1,1], necessarily -1 < t - x < 1 that is, t - 1 < x < t + 1. So
,t+l 1 ,4+1 1
(h*h)V)= -(t-x)dx = t -dx-2.
Jt-i  x Jt-i x
But this means limsjv(a;) — j(x) = 0, as desired. See 7.1.5.(2)).
Now suppose the function / or its derivative has a discontinuity at a; = 0. Since the function belongs to 51, it is already continuous and differentiable on a neighbourhood of the point x = 0 (outside the point itself). Split / into its even part f± and its odd part f2. That is, write j(x) = fi(x) + f2(x), where
/i(*)=4(/(z)+/(-*))   for ^0 fi(0) = l( I™ f(x)+ lim f(x))
/2W = ^(/W-/(-^))-
Then the even part /i (x) is continuous and differentiable at the point x = 0 because of the existence of the one-sided limits, and so on a neighbourhood of the point x = 0. Also J2(0) = 0, and the Fourier series for /2 contains only the terms with sin(no;a;).
Thus we can refer to the previous continuous case and obtain, for the Fourier series F (x) of the function /, the equation
F(0) = U lim f(x)+ lim /(*))+0,
Z x—>-0+ x—vO-
which is what we wanted to prove.
In the case of discontinuity at a point other that x = 0, we can proceed similarly. This completes the proof. This also completes the proof of the statements (2) and (3) of theorem 7.1.8 where we required that the Dirichlet condition be true.
2. Integral operators
7.2.1. Functional. In the case of finite-dimensional vector ^•k       spaces, we can regard the vectors as mappings from a finite set of fixed generators into the W/S?-'   space of coordinates. The sums of vectors and y>'viee--    me scaiar multiples of vectors were then given by the corresponding operations with such functions. We also worked with the vector spaces of functions of a real variable in the same way when their values were scalars (or vectors as well).
The simplest linear mapping a between vector spaces maps vectors to scalars. It is called a linear functional. It is defined as the sum of products of coordinates x{ of vectors with fixed values a{ = a(ej) at the generators e{, i.e. by one-row matrices:
(x1,... ,xn)T h-> (qi, ..,«„)• (xu .. .,xn)T.
More complicated mappings, with values lying in the same space, are given similarly by square matrices. We approach linear operations on spaces of functions in an analogous way.
For simplicity, we work with the real vector space S of all piecewise continuous real-valued functions having compact support and defined on the whole R or on an interval I = [a, 6]. Linear mappings S —> R are called (real) linear junctionals. Examples of such functionals can be given in
472
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
This last integral is improper iff — 1 < 0 < i + 1. For t outside that interval, the integration gives
(/i*/2)(i) = iln
t + 1
t-1
-2.
If instead, —1 < t < 1, we can for small e > 0, replace A*4",1 - da; with
Jt — 1 X
t+1
dx +
dx
which computes to In \t + 1 — In |e + In — e\ — In \t — 11. The terms in e cancel, so when we take the limit e —> 0, we obtain the same answer for the integral as before. Thus
(/i*/2)(i)=iln
t + 1
t - 1
for all values of t except for t = 1, or for t = — 1.
□
We calculate the convolution of two functions both of which have a bounded support.
7.C.3.   Determine the convolution fi * fa where
l-x2 fora;e[-l,l] 0 otherwise,
x   for x £ [0,1], 0 otherwise.
Solution. Since the integrand is zero when fi (x) = 0,
h*h(t)= /     fi{x)fa{t-x) dx
(l-x2)fa(t-x) dx
But the integrand is also zero when fa (t — x) = 0, so we need 0 < t — x < 1 ie. t — 1 < x < t for the integrand to be non zero. So for a non zero value of f± * fa(t), we integrate over the intersection of the intervals [t — l,t] and
two different ways — by evaluating the function's values (or its derivatives') at some fixed points or in terms of integration.
We can, for instance consider the functional L given by evaluating the function at a sole fixed point x0 e I, i. e.,
L{f) = f(x0).
Or, we can have the functional given by integration of the product with a fixed function g(x), i.e.,
L(/)= f f(x)g(x) dx.
The function g(x)in the previous example is a function which weighs the particular values representing the function j(x) in the definition of the Riemann integral. The simplest case of such a functional is, of course, the Riemann integral itself, i. e. the case of g(x) = 1 for all points x. A good example is given by
o
2e
if |a;| > e if \x\ < e.
for any e > 0. The integral of the function g over R equals one, and our linear functional can be perceived as a (uniform) averaging of the values of the function / over the e-neighbourhood of the origin. Similarly, we can work with the function
o
if |a; > e if la; I < e
which we used in the paragraph 6.1.5. This function is smooth on R with compact support on the interval (—e, e).
Our functional has the meaning of a weighted combination of the values, but this time, the weights of the input values decrease rapidly as their distance from the origin increases. The integral of g over R is finite, but it is not equal to one. Dividing g by this integral would lead to a functional which would have the meaning of a non-uniform averaging of a given function /.
Another example is the Gaussian function
l
which also has its integral over R equal to one (we verify this later). This time, all the input values x in the corresponding "average" have a non-zero weight, yet this weight becomes insignificantly small as the distance from the origin increases.
7.2.2.
Function convolution. Integral functionals from the previous paragraph can easily be modified to obtain a "streamed averaging" of the values of a given function / near a given point y e R:
LyU)
f(x)g(y - x) dx
473
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
[—1,1]. Consequently,
(/i*/2)(i)
= 0, if t>2
(1 - x2)(t - x)dx = At/3 - t2 + t4/12,
L
1 < t < 2
(1 - x2)(t - a;)da; = -t2/2 + 1/4 + 2i/3,
L
0 < i < 1
(1 - x2)(t - x)dx
= -t4/\2 + t2/2 + 1/4 + 2t/3,
1 < i < 0
0,   if   i < -1.
□
7.C.4. Determine the convolution f± * fa of the functions
fi =
l-x fora;G[-2,l] 0 otherwise,
h
1 for x £ [0,1] 0 otherwise.
o
The next topic is the Fourier transform, which is another <£gs> example of an integral operator. This time the kernel e~2"* is complex (see 7.2.5 for the terminology). Thus the values on real functions are complex functions in general, see 7.2.5. This is a basic operation in mathematics, allowing the time and frequency analysis of signals and also the transitions between local and global behaviour.
7.C.5. Fix n > 0. Recall that sgn/j = 1 if t > 0, and sgn/j = -1 if t < 0, sgnO = 0.
Find the Fourier transform T(f) and the inverse Fourier transform J7-1 of the functions:
(a) j(t) = sgni if t e        fH), and zero otherwise.
(b) j(t) = 1 if t e        fH) and zero otherwise.
Solution. The case (a).
The Fourier transform of the given function is
oo
•^(/)H = 77fc / /We-^dt
— oo
n
= —t== J sgnf (cos(cut) — ism(cut)) dt
Convolution of functions of a real variable
The free parameter y in the definition of the functional Ly(f) can be perceived as a new independent variable, and our operation Ly actually maps functions to functions again,
/-►/:
/OO f(x)g(y - x) dx. -oo
This operation is called the convolution of functions f and g, denoted f * g.
The convolution is usually defined for real or complex valued functions on R with compact support.
By the transformation t = z — x, we can easily calculate
that
(f*9)(z)= f(x)g(z-x)dx
f(z-t)g(t) dt=(g*f)(z).
Thus the convolution, considered as a binary operation
of pairs of functions having compact support, is commutative.
Similarly, convolutions can be considered with integration over a finite interval; we only have to guarantee that the functions participating in them are well-defined. In particular, this can be done for periodic functions, integrating over an interval whose length equals the period.
Convolution is an extraordinarily useful tool for modeling the way in which we observe the data of an experiment or the influence of a medium through which information is transferred. For instance, an analog audio or video signal affected by noise. The input value / is the transferred information. The function g is chosen so that it would express the influence of the medium or the technical procedure used for the signal processing or the processing of any other data.
7.2.3. Gibbs phenomenon. Actually, we have already seen a useful case of convolution. In paragraph 7.1.17, we interpreted the partial sum of the Fourier series for a function / _ as a convolution with the Dirichlet kernel Kn(v) = E-t/2 etulky. The figure shows this convolution kernel with n = 5 and n = 15.
474
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
/ n o = —A= I J (cos (ují) — i sin {ujt)) át— J (cos     ) — \o -n
i sin
Since cos and sin are respectively even and odd functions,
/ 2 cos(t^i7) — 1
The inverse Fourier transforms is given by almost the same integral, with the kernel eJ"x instead of e~J"x . The integration is in the frequency domain with variable cu. Thus, the only difference in the result is the sign:
2i rcosQt) i n
2 l-cos(tl7)
Case (b) is computed similarly:
J-(f)(uj) = —t= J (cos(ujt) — isin(o;i)) di
-m f cosM) át
1     rsin(kit) ] ß
2 sin(^;ß)
The latter expression is often expressed by means of the function sinc(i) = sin(i)/i
2Q
sine (lj í7).
Here, the inverse Fourier transform has exactly the same result, because the sign change in the kernel does not affect the real part. Thus we only need to interchange the time and frequency variables:
^ (/)(*) =
2Í2
2tt
□
The results are:
Notice that instead of integrals over the entire real line, we employ the integration over the basic period T of the periodic functions in question.
This interpretation allows us to explain the Gibbs phenomenon mentioned in paragraph 7.1.9. The point is that we know well the behaviour of the Dirichlet kernel near to the origin and thus, taken into account that the function / is bounded over the whole period and has all one-side limits of values and derivatives at each point of discontinuity, the effect of the convolution must be quite local. Consequently this leads to verification that the convolution with the Dirichlet kernel in the point x of jump of / behaves the same way as we computed explicitly for the Heaviside function at x = 0. There the overshooting by the Fourier sums can be computed explicitly and this explains the Gibs effect in general.
We do not provide more details here. Readers may either work them out themselves (as a nontrivial exercise) or look them up in the literature.
7.2.4. Integral operators. In general, integral operators can depend on any number of values and derivatives of the function in its argument. For example, considering a function F depending on k + 2 free arguments,
L(f)(y) = J F(y,x,f(x),f'(x),...,fW(x))dx.
Convolution is one of many examples of a special class of such operators on spaces of functions
Hf)iy)= f f(x)k(y,x)dx.
The function k(y, x), dependent on two variables,
k : R2 -4- R,
is called the kernel of the integral operator L.
The theory of integral operators is very useful and interesting. We focus only on an extraordinarily important special case, namely the Fourier transform T, which has deep connections with Fourier series.
7.2.5. Fourier transform. Recall that a function f(t), given by its converging Fourier series, equals
/(*)= E
where the numbers c„ are complex Fourier coefficients, and cjn = n2n/T with period T, see paragraph 7.1.7.
After fixing T, the expression Acu = 2ir/T describes the change of the frequency caused by n being increased by one. Thus it is just the discrete step by which we change the frequencies when calculating the coefficients of the Fourier series. The coefficient 1/T in the formula
T/2
T/2
dt
475
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The first two diagrams below show the imaginary values of the Fourier image of the signum function from 7.C.5(a) with n = 20 and fl = 50. The next two diagrams do the same for the characteristic function of the interval \x\ < fl from 7.C.5(b). The longer the interval with the constant values is, the more the image is concentrated around the origin. We can always use directly the simpler version of the transform for the odd and even functions. If the argument / is odd, then only the sine part of the formula contributes and its Fourier transform is
oo
Hf )(.<») = jr: J fit) sin (art) dt.
Similarly, for even functions /
J f(t) cos(o;i) dt.
In particular, the odd functions have pure imaginary images while the images of the even functions are real. More generally, every real function / decomposes in to its odd and even parts / = /even +/0dd and the real and imaginary components of the Fourier image / are the images of these two parts, respectively.
7.C.6. Discover how the Fourier transform and its inverse behave under the translation ra in the variable,
Taj(x)= f(x + a), and the phase shift ipa denned as
<Paf(x) — elax f(x)> always with a£l,
Solution. Evaluate the compositions Jor„ and ToLpa. This is easy:
ToTaf(u) = I     f(t + a)e-^dt
f(x) e-2^-") d x = etau> Tf{u). TOVaf{üü) = I f{t)ewte-^dt
f(t) e-^-^dt^Tfiuj-a).
then equals Auj/2-k, so the series for f(t) can be rewritten as
oo , T/2 \
/(*)=   £  ^(a,/ /(xje-^dzje-*.
n=-oo        v        J-l/2 /
Now imagine the values cun for all n e Z as the cho-fjj,, sen representatives for small intervals [cun, cun+1] of length Acu. Then, our expression in the big inner parentheses in the previous formula for / (t) describes f ^   the summands of the Riemann sums for the improper integral
1 r°°
— / g(L,)e«*dL>,
21" J-oo
where g(ui) is a function which takes, at the points cun, the values
g{ujn) = f '   f(x)e-^x dx.
J-T/2
We are working with piecewise continuous functions with a compact support, thus our function / is integrable in absolute value over R. Letting T —> oo, the norm Acu of our subin-tervals in the Riemann sum decreases to zero. We obtain the integral
f(x)e-
dx.
The previous reasonings show that there is a large set of Riemann integrable functions / on R for which we can define a pair of mutually inverse integral IsSiIJ operators:
Fourier transform
For every piecewise continuous real or complex function / on R with compact support, we define
1 f°°
_F(/)H = /H = ^== /    f(t)e-«* dt.
2tt
This function / is called the Fourier transform of the function /. The previous ideas show that
1
/2tt
/(o;)eJ"t duo.
This says that the Fourier transform T just denned has an inverse operation T~x, which is called the inverse Fourier transform.
Notice that both the Fourier transform and its inverse are integral operators with almost identical kernels
k(oj,t) = e±J"*.
Of course, these transforms are meaningful for a much larger class of functions. Interested readers are referenced to specialized literature.
476
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Thus is proved the formulae
ToTa = paoT,     To pa = t-a o T.
Similarly,
ota = ip-a o T~X ,     F~X olpa = TaoT~1 .
□
The next problem displays the behaviour of the Fourier transform on the Gaussian function. This is a rare example, where the time and frequency forms are very similar. Again, we see the feature of exchanging the local and global properties in the time and frequency domains.
7.C.7.   Compute the Fourier transform T(f) of the function
f(t)=e-at\ teR, where a > 0 is a fixed parameter. Solution. The task is to calculate
—00
A standard trick is to transform the problem into one of solving a (simple) differential equation. Hence:
Differentiating (with respect to cu) and then integrating by parts gives
Hf)(^ 1 1
_ . lim — e~at -wt - lim — e~at ~lult 2ir \t->-oo 2a t-t—00 2a
i(—icj)
1
2tt V2a t-
2a
lim e
2a
2a t
lim e
'2a
1
/2tt ./_™ / 2a
Therefore y(ui) = T(j)(u) satisfies the differential equation
i.e.   - d y = — 7T- du
dj/ du
2a"y> y - 2a
unless y equals zero (y = 0 is a solution of the equation). Integration yields
where C and K are constants. All solutions (including the zero solution) of the differential equation are given by the function
7.2.6. Simple properties. The Fourier transform changes the local and global behaviour of functions in an interesting way. We begin with a simple example in which there is a function j(t) which is transformed to the indicator function of the interval [—■!?, i?], i. e., f(ui) = 0 for > fi, and f(ui) = 1 for I a; I < fi. The inverse transform J7-1 gives
/(*) =
dw =
1
V2irt2i 2Ü sin(ßt)
27T nt
Thus, except for a multiplicative constant and the scaling of the input variable, it is the very important function j(x) = sinc(x) =
Calculation of the limit at zero, by l'Hospital's rule or otherwise, gives /(0) = 2fi(27r)-1/2. The closest zero points are at t = ±7r/i? and the function drops in value to zero quite rapidly away from the origin x = 0. This function is shown in the diagram by a wavy curve for fi = 20. Simultaneously, the area where our function j(t) keeps waving more rapidly as fl increases is also depicted by a curve.
The indicator function of the interval [—fl, fl] is Fourier-transformed to the function /, which has takes significant positive values near zero, and the value taken at zero is a fixed multiple of fl. Therefore, as fl increases, the / concentrates more and more near the origin.
Now we derive the Fourier transform of the derivative j'(t) for a function /. We continue to suppose that / has compact support, so that both T(f') and T(f) exist. By integration by parts,
1
1
V2tt l
2ir
dt
+
dt
Thus the Fourier transform converts the (limit) operation of differentiation to the (algebraic) operation of a multiplication
477
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
To find K, begin with the well known fact (proved in ??)
J e~x d x = v^.
—00
to obtain
/ e"a*2 di =      / e""2 d
Therefore, J"(/)(0)
K e° = K So K = -4- and
1   \AF _ _
/2tt \/ä \/2ä
i-andJl(/)(0)
□
7.C.8. Determine the Fourier transform image of the Gaussian function
1
: e 2tr2
/(*)
0-v27t
Solution. Use the result of the previous problem with a = 2^2 and the composition with the variable shift ra from the last but one problem. It follows that
1 - _
/2tt
□
As mentioned, the most typical use of the Fourier trans-form is to analyse the frequencies in a signal. The next problem reveals the reason. For tech-11 nical reasons we cut the signal by multiplication with the characteristic function hn of the interval (—fi, fi).
7.C.9.   Find the Fourier transform of the functions
fit) = hn(t) cos(nt),   g(t) = hn(t) sin(ni).
Solution. By definition
/2tt
cos(nf) e~iutdt
n
J 2v
|e-^di
2v/2tt yi{n-uj)
J(n-uj)t]n
+
J2^
2v/2tt li(n + u)
-i(u+n)t] n
z{{n — uj)fi) + sinc((n + uS)fi)).
by the variable. Of course, this procedure can be iterated, to obtain
7.2.7. The relation to convolutions. There is another extremely important property to consider, namely the relation between convolutions and Fourier transforms. Calculate the transform of the convolution h = / * g, where, as usual, the functions are assumed to have compact support. Recall that we may change the order of integration, see 6.3.13. Then we change variable by the substitution t — x = u. The result is
T{h){oS) =
1
/2tt
f(x)g(t - x) dx
dt
f(x)[ /    g(t -x)e~lujt dt dx
-±= J™ g(u)e-^+^ du^j dx
/We"
dx
du
A similar calculation shows that the Fourier transform of a product is the convolution of the transforms, up to a multiplicative constant. In fact,
2tt
As we mentioned above, the convolution f * g often models the process of the observation of some quantity /. Using the Fourier transform and its inverse, the original values of this quantity are easily recognised if the convolution kernel g is known. We just calculate T(f * g) and divide it by the image T(g). This yields the Fourier transform of the original function /, which can be obtained explicitly using the inverse Fourier transform. This is sometimes called deconvolution.
In real applications, the procedure often cannot be that straightforward since the Fourier image of the known convolution kernel might have zero values and therefore we hardly can divide by it as above. For example, take the convolution kernel sinc(i) whose image is an indicator function of some finite interval. So we need some more cunning techniques and there is a vast literature on them.
7.2.8. The L2-norm. As an illustration of the power of our simple results, look at the behaviour of the Fourier transform with respect to the L2-norm. We write g for the function
g(t) = g(—t) and notice that
(f*9)(t)=       f(x)g(t-x)dx= f(x)g(x-t)dx
J-co J-co
In particular, the scalar product is given by the formula
</,<?> = (/*s)(o).
478
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The same computation leads to the image of the sine signal, the only difference is one minus sign and an additional i in the formula:
J-(g)(uj) = i _(— sinc((n — ui)fl) + sinc((n + ui)fl))
□
7.C.10. Find the Fourier transform of the superposition of the cos(ni) signals over the interval (—fl, fl),
f(t) = hn(t)(3 cos(5t) + cos(15f)).
What happens, if fl —> oo?
Solution. The Fourier transform is linear over scalars, thus we simply add the corresponding images from the previous problem with n = 1 and n = 3, multiplied by the proper coefficients. The illustration of the image with fl = 20 is
~*4, to*
Each of the peaks behaves like the Fourier image of the characteristic function ha, shifted to the frequencies.
If fl increases to infinity, the image / has four peaks at the same positions corresponding to the frequencies ±5 and ±15. But they become narrower and sharper. In the limit, this is no longer a function since the width of the peaks becomes zero. This is usually written
J-{cos{nt)){uj) = <J -^(3(n ~      + &{n + a;) with the special case
T{\){Q) = V2iS{uj). See 7.2.9 for comments on the Dirac delta function. □
7.C.11. Find the Fourier transform image of the convolution of the signals f(t) and g(t) from the Problem 7.C.I. Recall that /(f) = sin(i) + 0.4 [sin(6f )]2 - 0.2 sin(60f) and g is the characteristic function of the interval (—e,e). Assume that the signal is nonzero only in the interval (—fl, fl). Solution. Once we note that T(f * g) = \f2iiT(f)T(g), we have all the ingredients ready. Indeed, in 7.C.5 and in the last two problems above, we already computed the Fourier
Now, the definition of the Fourier transform yields
as) = (F-iHf*mo)
1 r°°
J\f * g) eix» dw]x=0
while
f{uj)g{uj) du
g(-x)e-^ dx = g.
Consequently, (f,g) = (f,g)- Thus, we have verified that the Fourier transform preserves the scalar product and so it also preserves the L2-norm.
This also explains our choice of the constants in the definition.
7.2.9. Dirac delta-function. We return to the first example of the inverse transform to the indicator function fa of the interval [—fl, fl]. Let fl approach infinity and denote by V/27T(5(f) the desired "limit function" for T~x (fn)(t). The inverse image of a product with an arbitrary image T(g) can be expressed using convolution:
1 7°°
T-1 (/„ ■ Hg))(z) = -= /    g^F-1 (fn)(z - t) dt.
2ir
As fl increases to oo, the left-hand expression should approach T~x (T(g))(z) = g(z), while on the right-hand side, we get
g(z)= / g(t)S(z-t)dt.
J — oo
The desired S(t) thus looks as a "function" which takes zero everywhere except the single point t = 0 where it "has an infinite value". Integrating the product of S(t) with any inte-grable function g gives just the value of g at the point t = 0. Of course, this is strictly not a function at all. Nevertheless it is a useful concept. It is called the Dirac function S and it can be described correctly as an example of what is known as a distribution. Since we do not have enough space and time, we do not pay further attention to distributions. We mention only that the Dirac S can be imagined as a unit impulse at a single point. In fact, we saw similar concepts under the name "measure" when dealing with the Riemann-Stieltjes and Henstock-Kurzweil integrals, cf. 6.3.19,6.3.15 and we shall come back to them in Chapter 10 in the context of probability. In this sense, the Dirac function is the (probability) measure concentrated in the origin and it can be realized by the Riemann-Stieltjes integral with the piecewise constant function g with the single unit jump in the origin. Its Fourier transform is the constant function T{S) (ui) = -j=.
On the other hand, many functions which are not strictly integrable on R are Fourier-transformed to expressions with the Dirac S. For instance,
J-(cos(nt))(cu) = y^(<5(n — ui) + 8(n + uj)),
479
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
image of g and of the sine and cosine functions on the interval
Instead of writing the explicit formulae for the result, we display illustrations of the real components of T(f) and F(J * g) in the first line, and similarly the imaginary components in the second line, all with i? = 5,e = l/10.
The reader should compare the diagrams of / and / * g in 7.C.1 to see that the higher frequencies in / are effectively canceled by this convolution, as expected. □ As discussed in 7.2.5, the Fourier transform has an inverse operation. This means that no information is lost when changing from the time behaviour of
S^J^i a signal to its frequency behaviour. This allows us to use Fourier transform for the solution of functional equations involving differentiation or integration. We stay with elementary observations only and return to differential equations in one and more variables in the following chapters.
7.C.12. By using the inverse Fourier transform solve the integral equation
oo
J f(t) sin(xt) dt = &~x, x>Q o
for an unknown function /.
Solution. Multiply both sides of the equation by \J2j-k, to obtain the sine Fourier transform on the left-hand side. Apply the inverse transform to both sides of the equation to get
oo
f(t) = £ / e"* sin(xt) dx,   t > 0.
which can be seen from the calculation of the Fourier transform of the function Jq cos(nx) and then letting fi approach oo, see the solution to problem 7.C.10.
We can obtain the Fourier transform of the sine function in a similar way. We can take advantage of the fact that the transform of the derivative of this function differ only by a multiple of the imaginary unit and the new variable. Alternatively, we can also use the fact that the sine function is obtained from the cosine function by the phase shift of 7r/2.
These transforms are a basis for Fourier analysis of signals (see also problem 7.C.9): If a signal is a pure sinusoid of a given frequency, then this is recognized in the Fourier transform as two single-point impulses exactly at the positive and negative value of the frequency. If the signal is a linear combination of several such pure signals, then we obtain the same linear combination of single-point impulses. However, since we always process a signal in a finite time interval only, we get not single-point impulses, but rather a wavy curve similar to the function sine with a strong maximum at the value of the corresponding frequency. The size of this maximum also yields information about the amplitude of the original signal.
Another good way how to approximate the Dirac delta function is to exploit the Gaussian functions. As seen in the solution to problem 7.C.7, the Fourier image of the Gaussian function
/(*)
1
/27T
is again Gaussian corresponding to the reciprocal values of a. In the limit a —> 0, the image converges fast to the multiple of constant function, see the illustrations, with a = 3 and
a = 1/10.
Notice that the rather large a in the first illustration corresponds to a wide Gaussian, while the image is the slim one. The second illustration provides the opposite case. The preim-age is the narrow Gaussian and the image is already reasonably close to the constant function. The Gaussians are chosen with Li-norm equal to one, but the Fourier transform preserves the L2-norm of the functions.
7.2.10. Fourier sine a cosine transform. If we apply the Fourier transform to an odd function f(t), where f(—t) = —f(t), the contribution in the integration of the product of
480
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Integrating by parts twice shows that
J e~x sin (xt) dx = f-prj [— sin (xt) — t cos (xt)] + C. Hence
e x sin(a;i) da;
J-irn^ ^f+p [- sin (xt) - t cos (xt)] ^ - e1+t^
t
l+t2
So
rm = l 1
TT l+t2'
t > 0.
□
7.C.13. Use the Fourier transform to find solutions of the non-homogeneous linear differential equation
(1) y' = ay + f
where a e K is a non-zero constant and / is a known function. Can all solutions be obtained in this way? Solution. The key observation for this problem is the relation between the Fourier transform and the derivative
T(f)(uj) = iujT(f)(uj),
see 7.2.6. Thus, if the Fourier transform is applied to the equation (1), we get the algebraic equation for y = T(y)
iuy = ay + f.
If it is assumed that T(j) = f exists and there is a solution y with the Fourier image y, then
1
-/
y = -
iuj — a
and using the general relation T~x (g ■ h) = \fhiT~1 f * T~x g between the product and convolutions from 7.2.7 we arrive at the final formula
1
y ■
■.T-1
i
/2ir       \iu — a So it is necessary to compute the inverse Fourier transform of
the simple rational function (ílu — a) in two steps.
Assume first a < 0 and evaluate
Guess the solution
la—tu j
0
1
Similarly for a > 0
This provides the two desired results. Indeed, if the equation (1) comes with a > 0, we rewrite our rational function as
— (a — icu)-1. Next, the function — \/2~Ke~at for negative t
f(t) and the function cos(±o;i) cancels for positive and negative values of t. Thus if / is odd, then
-2i
j (ť) sinaví dt.
The resulting function is odd again, hence for the same reason, the inverse transform can determined similarly:
F(f)(oj) = -= / f(t)smojtdt.
V 27T Jo
Omitting the imaginary unit i gives mutually inverse transforms, which are called the Fourier sine transform for odd functions:
fs(u) = \j- J    f (ť) sin(o;í) dt,
fit)
f s (ť) sin(o;í) dt.
Similarly, we can define the Fourier cosine transform for even functions:
/(*) =
f(t) cos(o;i) dt,
f s (ť) sinaví dt.
7.2.11. Laplace transforms. The Fourier transform cannot be applied to functions which are not integrable in absolute value over R (at least, we do not obtain true functions). The Laplace transform is similar to the Fourier transform:
C(f)(s) = f(s) = f(t)e
dt.
The integral operator C has a rapidly reducing kernel if s is a positive real number. Therefore, the Laplace transform is usually perceived as a mapping of suitable functions on the interval [0, oo) to the function on the same or shorter interval. The image £ (p) exists, for example, for every polynomial p(t) and all positive numbers s.
Analogously to the Fourier transform, we obtain the formula for the Laplace transform of a differentiated function for s > 0 by using integration by parts:
poo
£(f'(t))(s)= /    f(i) e~st dt
= if(t)e-sV + sJ    f(t)e~st dt
= -/(0)+ *£(/)(*).
The properties of the Laplace transform and many other transforms used in technical practice can be found in specialized literature. We provide a few examples in the other column starting with 7.D.I.
481
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
provides the requested Fourier image. Immediately it is seen that the convolution
y(t)
eax f(t-x)dx
is a solution. (The multiples V2tt in the expression with the convolution cancel.) Similarly, if a < 0 then
y(t)
eax f(t-x)dx
is a solution.
Not all solutions can be obtained in this way. For example, y' = y leads to y(t) = C e* with an arbitrary constant C, but this is not a function with a Fourier image. With j(t) = 0, our procedure produces the zero function, which is just one of the solutions. Similarly, if we deal with the equation y' = y + t, then the particular solution suggested above is
y(t) = -f    ex{t-x)dx = -t-l.
J — CO
□
7.C.14. Check directly that the two functions y(t) found above are indeed solutions to the equation y' = ay + f.
o
7.C.15. As in the previous problem, solve the second order equation
v" = ay + f.
Solution. Use the fact that T(y") (cu) = — u2T(y)(u) and deduce the algebraic relation — cu2y = ay + f, for the Fourier images y and /. Hence
-1
y ■
or + a
In order to guess the correct preimage of the rational function in question, first assume a > 0 and compute
r_________     iujxiu
1 1 +
+
a — ioj     a + ilj Thus it is verified that
. a-\-iüJ
2a
1
2tt a + uj2 Immediately (the factors \/2~k cancel)
y(t) =      e-^l'l *f(t) = -= /    e-v^l f(t -x)d:
7.2.12. Discrete transforms. Fourier analysis of signals mentioned in the previous paragraph are realized by special analog circuits in, for example, radio technology. Nowadays, we work only with discrete data when processing signals by computer circuits. Assume that there is a fixed (small) sampling interval r given in a (discrete) time variable and that, for a large natural number N, the signal repeats with period Nt, which is the maximal period which can be represented in our discrete model. We should not be surprised that our continuous models allow for a discrete analogy. Consider an ^-dimensional vector, which can be imagined as the function r i-> /(r) G C, for r = 0,1,..., N — 1.
Denote Acu = ^ and
kAu). The simplest discrete
approximation of the Fourier transform integral suggests that
iV-l
/>) = £/We_i**
r=0
should be a promising transformation / i-> /, whose inverse should not be far from
JV-l
/»= £/>K
Actually, these are already the mutually inverse transformations:
Theorem. The transformation above satisfies f(k) = f(k) for all k = 0,1,..., N - 1.
Proof. Let
Then
N
and so by subtraction,
(1 - e^k)T = 1 - el2l,k .
The right hand side is 0 for all integers k. On the left side, the coefficient of T is not zero unless k is a multiple of N. Hence
ir^j-k     I N   if  is a multiple of N 0 otherwise.
With k and s both confined to the range {0,1,2 ... 7Y — 1}, k — s can only be a multiple of N when k = s. It follows that for such k and s
£ ejr^(fe-s) = 8ksN
r=0
where the Kronecker delta Sks = 0 for k =^ s and Sks = 1 if k = s.
Finally, we compute:
482
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The case a < 0 is a little more complicated. But we may ask Maple or look up in the literature that the function
g(t) = sin(b|i|)
has the Fourier image
T(g)(u
1
2b
2tt b2 - uj2 '
We are nearly finished. The required preimage is
/2tt
h(t) = —p= sin(v^a|f
and the resulting convolution is
1
y(t) =
(\/—a\x\)f(t — x) d:
If we rewrite the equation as y" + by = f with b > 0, the result says
1
y(t) =
2b
\(b\x\)f(t — x)dx.
□
7.C.16. Check directly that the two functions y(t) found above are indeed solutions to the equation y" = ay + f.
o
D. Laplace Transform
The Laplace transfer is another integral transform which \\ ^' /, interchanges differentiation and algebraic multiplica-
fr tion. As with the Fourier transform, this is based on the properties of the exponential function, but this time we take the real exponential, see 7.2.11 for the formula. One advantage is that every polynomial has its Laplace image.
7.D.I. Determine the Laplace transform C(f)(s) for each of the functions
(a) f(t) = eat;
(b) f(t)
(c) f(t)
(d) f(t)
(e) f(t)
(f) /(*)
(g) /(*)
ci eai* + c2ea2*;
■ cos (bi);
■ sin(bt);
■ cosh (bt);
■ sinh (bt);
■ tk, ke N,
where the constants £ K and a, a1, a2, ci, c2 G C are arbitrary. It is assumed that the positive number s e K is greater than the real parts of the numbers a,a1,a2 G C, and is greater than b in the problems (e) and (f).
JV-l  , JV-l
r=0  x s=0 JV-l
s=0 JV-l
JV-l
^ £/(«)(£*
±'EmsksN = f(k).
N
s=0
□
The computations in the proof also verify that the Fourier image of a periodic complex valued function with a unique period among the chosen sampling periods are just its amplitude at this particular frequency. Thus, if the signal has been created as a superposition of periodic signals with the sampling frequencies only, we obtain the absolutely optimal result. However, if the transformed signal has a frequency not exactly available among the sampling frequencies, there are nonzero amplitudes at all the sampling frequencies in the Fourier image. This is called frequency leaking in the technical literature. There is a vast amount of literature devoted to fast implementation and exploitation of the discrete Fourier transform, as well as other similar discrete tools. This is an extremely active area of current research.
3. Metric spaces
At the end of the chapter, we will focus on the concepts of distance and convergence in a more abstract way. This also provides the conceptual background for some of the already derived properties of Fourier series and Fourier transform. We need these concepts in miscellaneous contexts later.
It is hoped that the subsequent pages are a very useful (and hopefully manageable) trip into the world of mathematics for the competent or courageous!
7.3.1. Metrics and norms. When we discussed Fourier series, the distance between functions on a space
i      of functions, was commonly referred to. Now
oughly.
we examine the concept of distance more thor-
483
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Solution. The case (a). It follows directly from the definition of the Laplace transform that
oo
£(/) « = /<
°re-(s-a)t dt= lim í R^oo \ -(s-a> J
1
s — a'
The case (b). Using the result of the above case and the linearity of improper integrals, we obtain
oo oo
£ (/) (s) = ci J eaiť e"st d t + c2 J e"2* e"st d t =
0 0
ci     _|_ c2
s-ai s—a2
The case (c). Since
cos (bť) = \ (ýht + e"
ibt\
the choice ci = 1/2 = c2, ai = ib, a2 = —ib in the previous case gives
oo
£ (/) (s) = / (|ejW + ie-jbt) e"s* di = o
2{s-ib)      2(s+ib) — s2+b2 '
The cases (d), (e), (f). Analogously, the choices
(d) c\ = —i/2, c2 = i/2, ai = ib, a2 = —ib;
(e) c\ = 1/2 = c2, ai = b, a2 = —b;
(f) ci = 1/2, c2 = -1/2, ai = b,a2 = -b
lead to
(d) £(/) W = 545T;
(e) £(/) (S) = ^t;
(f) £(/) W = ^st.
Finally, the last one is obtained by a straightforward repetition of integration by parts:
C(tk)(s) = / tke-stdt Jo
k\
sk .10
k\
Qk+1 '
Axioms of a metric and a norm
A set X together with a mapping tl:IxX^R such that for all x,y, z e X, the following conditions are satisfied
(1) d(x, y) > 0; and d(x, y) = 0 if and only if x = y,
(2) d{x,y) = d{y,x),
(3) d{x,z) <d{x,y) + d{y,z),
is called a metric space. The mapping d is a metric on X.
The Euclidean distance in the vector spaces R™ satisfies the above three requirements.
If X is a vector space over R and || || : X —> R is a function satisfying
(4) ||a;|| > 0; and ||a;|| = 0 if and only if x = 0,
(5) ||Aa;|| = |A| ||a;||, for all scalars A,
(6) ||x + y|| < ||x||+ ||y||,
then the function || || is called a norm on X, and the space X is then a normed vector space.
The L1 norm || || i and the L2 norm || ||2 on functions, as well as the Euclidean norm on R™, satisfy these properties.
A norm always determines the metric d(x, y) = \\x — y\\, which is also the case with the Euclidean distance.
But not every metric can be denned by a norm in this
way.
At the beginning of this chapter, we denned the distance between functions using the Li-norm. In Euclidean vector spaces, it is the norm \\x\\, which is induced by the bilinear inner product by the relation 11 x \|2 = (x,x). Similarly, we work with the norm on unitary spaces. We obtained the L2-norm on continuous functions in the same way.
Metrics given by a norm have very specific properties since their behaviour on the whole space X can be derived from the properties in an arbitrarily small neighbourhood of the zero element x = 0 G X.
□
7.D.2. Use the definition of the Gamma function r(t) in Chapter 6 in order to prove
1
c(ta)(s) = r(a + iy
,Q + 1
for general a > 0. Compare the result to the one of 7.D. 1(g).
o
7.D.3. For s > — 1, calculate the Laplace transform £ (g) (s) of the function
g(t)=te~t.
7.3.2. Convergence. The concepts of (close) neighbourhoods of particular elements, convergence of sequences of elements and the corresponding "topological" concepts can be denned on abstract metric spaces in much the same way as in the case of the real and complex numbers and their sequences. See the beginning of the fifth chapter, 5.2.3-5.2.8. We can almost copy these paragraphs; although the proof of the theorem 5.2.8 is much harder. We begin with the concept of convergent sequences in a metric space X with metric d:
484
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Further, for s > 1, calculate the Laplace transform C (h) (s) of the function
Cauchy sequences
h(ť) = t sinhi.
O
The basic Laplace transforms are enumerated in the following table:
y(t)
tK
eat
teat
sin uit coscjí eat šinut eať(coso;í + ^ šinut) t sin u)t sinuit — u)t cos u)t
~k~r
sk+i
(s-a)2 re!
(s-a)"+!
s2+lo2 s2+ui2
_uj_
(s — a)2-\-uj2
_s_
(s-a)2+u)2 2lüs
(s2+u)2)2
2lo3 (s2+u)2)2
7.DA. Establish the 5th and 6th rows of the table above using
o
Euler's formula e"
■ cosut + i šinut.
As expected, using the features of the Laplace transform allows us to find explicit solutions to some differential equations. By 7.H.8, it is straightforward to incorporate the initial conditions into the solution. We present just two such examples in the problems at the conclusion of this Chapter, see 7.H.11. We return to this topic in Chapter 8.
E. Metric spaces
The concept of metric is an abstract version of what we understand as the distance in Euclidean geometry. It is always based on the triangle inequality. 11 The axioms in Definition 7.3.1 follow the Euclidean experience, saying that our "distance" of two elements has to be strictly positive (except if the two elements coincide), should be symmetric in the arguments, and should satisfy the triangle inequality. Other concepts available in the literature are more abstract and might lead to more general objects (the most important ones being pseudomet-rics, ultrametrics, and semimetrics). 2
The first axiomatic definition of a "traditional" metric was given by Maurice Frechet in 1906. However, the name of the metric comes from Felix Hausdorff, who used this word in his work from 1914.
Consider an arbitrary sequence of elements xq, x\ ,... in X. Suppose that for any fixed positive real number e,
d(xi,Xj) < e
for all but finitely many pairs of terms x{, Xj of the sequence.
In other words, for any given e > 0, there is an index 7Y such that the above inequality holds for alii, j > 7Y. Loosely put, the elements of the sequence are eventually arbitrarily close to each other.
Such a sequence is called a Cauchy sequence.
Just as in the case of the real or complex numbers, we would like every Cauchy sequence of terms x{ e X to converge to some x in the following sense:
Convergent sequences
Let xq, x\,... be a sequence in a metric space X and let a; be an element of X. We say that the sequence {x{} converges to the element x, if, for every positive real number e, there is an integer 7Y > 0, so that i > N implies
d(xi,x) < e.
By the triangle inequality, it follows that for each pair of terms x{, Xj from a convergent sequence with sufficiently large indices,
d(xi,Xj) < d(xi,x) + d(x,Xj) < 2e.
Therefore, every convergent sequence is a Cauchy sequence. Conversely however, not every Cauchy sequence is convergent. Metric spaces where every Cauchy sequence is convergent are called complete metric spaces.
7.3.3. Topology, convergence, and continuity. Just as in the case of the real numbers, we can formulate the convergence in terms of "open neighbourhoods".
Open and closed sets
Definition The open e-neighbourhood of an element x in a metric space X (or just e-neighbourhood for short) is the set
Oe(x) = {y el; d(x,y) < e}. A subset U C X is open if and only if for all x e U, U contains some e-neighbourhood of x.
We define a subset W C X to be closed if and only if its complement X \ W is an open set.
Instead of an e-neighbourhood, we also talk about (open) e-ball centered at a. In the case of a normed space, we can consider e-balls centered at zero: along with a, e-balls determine an e-neighbourhood.
The limit points of a subset A c X are denned as those elements a e X such that there is a sequence of points in A other than a converging to a.
We prove that a set is closed if and only if it contains all of its limit points:
485
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.E.I. The discrete metric space X is denned as the set X with the function ii:Ix!->R
d(x,y) =
1   x y
0   x = y.
Show that this is a metric space according to Definition 7.3.1.
Show how to introduce a metric on Cartesian products of metric spaces, so that product of two discrete metric spaces is again discrete.
Solution. All three axioms of a metric from 7.3.1 are obviously satisfied in our definition of the discrete metric space.
Consider two metric spaces X and Y with metrics dx and dy. The first obvious idea seems to add the distances of the components, i.e.
d((xuyi), (x2,y2)) = dx(x!,x2) + dY{yi,y2).
Clearly this is a metric (verify in detail!), but if the metric spaces X and Y are discrete, then considering points u =
(x1,y1) and w = (x2,y2) such that x1 x2, y1 y2 we arrive at d(u,w) = 2. Thus, this is not a discrete metric space.
But there is another simple possibility of introducing a metric onX x Y using the maximum of the distances:
d{{xi,yi), (x2,y2)) = max{(dx(x1,x2),dY(yi,y2)}.
We call this the product of the metric spaces X and Y. The triangle inequality as well as the other axioms are obvious (write down the explicit arguments!). Moreover, if both X and Y are discrete, then d is also a discrete metric. □
7.E.2. Decide whether or not the following sets and mappings form a metric space:
i) N, d(m, n) = gcd(m, n)
ii) N, d(m,n) = _ 1
iii) World population, d(P1,P2) = n,
Pi = Xq,Xi,...,Xn+i = P2 is the shortest seuquence of people, such that Xi knows Xi+i for i = 0,...n.
Solution.
i) No. The "distance" d does not satisfy that d(m, m) = 0.
ii) No. The first and second conditions in the definition 7.3.1 are fulfilled, but the triangle inequality (property (3)) is not. The distance of 8 and 9 is 8, the distance of 8 and 6 is 3 and the distance of 6 and 9 is 2, thus d(8,9) > d(8,6) + d(6,9).
Suppose A is closed and a; is a limit point of A but not belonging to A. Then x e X\A which is open, so there is an e-neighbourhood of x not intersecting with A. But in every e-neighbourhood of x, there are infinitely many points of the set A, since a; is a limit point. This is a contradiction.
Conversely, suppose A contains all of its limit points and suppose x 6l\A If in every e-neighbourhood of the point x, there is a point xe e A, then the choices e = 1/n provide a sequence of points xn e A converging to x. But then, the point x would have to be a limit point, thus lying in A, which again leads to a contradiction.
For every subset A in a metric space X, we define its interior as the set of those points x in A for which a neighbourhood of x also belongs to A. We define the closure A of a set A as the union of the original set A with the set of all limit points of A.
As easily as in the case of the real numbers, we can verify that the intersection of any system of closed sets as well as the union of any finite system of closed sets is also closed.
On the other hand, any union of open sets is again an open set. A finite intersection of open sets is again an open set. Prove these propositions by yourselves in detail!
We also advise the reader to verify that the interior of a set A equals the union of all open sets contained in A, (alternatively put, the interior of A is the largest open subset of ^4). The closure of A is the intersection of all closed sets which contain A, (alternatively put, the closure of A is the smallest closed superset of ^4).
The closed and open sets are the essential concepts of the mathematical discipline called topology. Without pursuing these ideas further, we have just familiarised ourselves with the topology of the metric spaces.
The concept of convergence can be reformulated now as follows. A sequence of elements x{, i = 0,1,..., in a metric space X converges to x e X if and only if for every open set U containing x, all but finitely many points of our sequence lie in U.
Just as in the case of the real numbers, we can define continuous mappings between metric spaces:
Let W and Z be metric spaces. A mapping / : W —> Z is continuous if and only if the inverse image /~1 (V) of every open set V C Z is an open set in W. This is equivalent to the statement that / is continuous if and only if for every z = f(x) e Z and positive real number e, there is a positive real number S such that for all elements y e W with distance dw(x, y) < 8,    dz(z, f(y)) < e.
Again, as in the case of real-valued functions, a mapping / from one metric space to another is continuous if and only if it preserves the convergence of sequences (check this yourselves!).
7.3.4. Lp-norms. Now we have the general tools with which we can look at examples of metric spaces created by finite-dimensional vectors or functions at our disposal. We restrict ourselves to an extraordinarily use-ffl 1 ful class of norms.
486
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
iii) No. The "distance" is not symmetric. It would be a metric space, if the definition the word "knows" is changed to mean "know each other".
□
7.E.3. Consider the set of binary words of the length n. Define the distance between two words as the number of bits in which they differ. This is called the Hamming distance, see 12.4.2). Show that it defines a metric.
Solution. The first two axioms of a metric are satisfied. For the third one let the words x and z differ in k bits. Let y be another word. Then consider just k bits, in which x and z differ. Clearly y differs in each of these bits exactly from one each of x and z. Thus considering only the parts of words xp, yp, 2p in the fc bits, we.have.d(xp,yp)+d(yp, zp) = d(xp,zp). In the other bits, the words x and z are the same, while x and y or y and z may differ. Thus d(x, y) + d(y, z) > d(x, z) and the third axiom is satisfied also. □
7.E.4. Consider any connected subset S C K™ (any two points in S can be connected with a path lying in S). Define the distance between two points as the length of the shortest path between the points. Is it a metric on 5"?
Solution. It is a metric. All the axioms of the metric are trivially satisfied. But this metric has a special significance. The principle of "shortest way" is often met in reality. Recall for example Fermat's principle (see 5.F.10) of the least time, where we measure the length of a path by the time it is traveled by light. Generally, shortest paths in a metric space are called geodesies. □
7.E.5. Consider a space of integrable function on the interval [a, 6]. Define the (Li) distance of the functions /, g as
\f(x) -g(x)\dx
. Why it is not a metric space?
Solution. The first axiom of the metric space in 7.3.1 is not satisfied. Any function of zero measure has distance 0 from the null function.
But if we consider an equivalence where two functions are equivalent, if they differ by a function of measure zero, then we get the space 5° (a, 6). The given distance considered on the equivalence classes of this equivalence is the L1 metric.
□
We begin with the real or complex finite-dimensional vector spaces R™ and C", and for a fixed real number p > 1 and any vector z = (z1,..., zn), we define
-Ei
i/p
We prove that this indeed defines a norm. The first two properties from the definition are clear. It remains to prove the triangle inequality. For that purpose, we use Holder's inequality: This is
Holder inequality
Lemma. For a fixed real number p > 1 and every pair of n-tuples of non-negative real numbers Xi and yi,
n , n        n 1/p    , n \l/q
<(£*?) -fey?
i=l M=l       ' M=l
where 1/q = 1 — 1/p.
Proof. Denote by X and Y the expressions in the product on the right-hand side of the inequality to % be proved. If all of the numbers x{ or all of the numbers    are zero, then the statement is clearly true.  Therefore, we can assume that X    0 and Y 0.
We need to use the fact that the exponential function is a convex function. (This can be stated: the graph of the exponential function lies below any of its chords). Hence, for any a and b, with p, q as above,
e(i/P)a+(i/9)b < (l/p)ea + (l/q)eb
(in fact, this is a Jensen inequality, see). Define the numbers vk and wk so that
Then
xk = Xev*/P,    yk = Yew^«.
sVk/p+Wk/q < I g-"* _|_ i &wk
By substitution, it follows immediately that
XYXkVk-p\x) +-q\Y
Summing over k = 1,..., n, gives
^      n "
XY
1
—v-
pXP
qY
1 "
= Ar^ + 4rY" = - + - = l.
pXP qYi p q
Multiplying this inequality by XY finishes the proof. □
Now we can prove that || || is indeed a norm:
487
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.E.6. Let r be a rational number and p a prime number. Then r can be uniquely written in the form r = pk^, where u e Z and v e N are coprime and p does not divide both, the numerator u and the denominator v. Consider the map ||. ||p : Q —> R, II r II i-> Show that it is a norm on Q as a vector space over Q. It is called the p-adic norm O
Solution. It is an exercise in elementary number theory. □
7.E.7.   Consider the power set (the set of all subsets) of cjg3> a given finite set.   Determine whether the functions di and d2, defined for all subsets X, Y by
(a) d!(X,Y) :=|(IUľ)\ (X n Y) |, (b)    d2(X,Y)    :=    l(*%fcffny)l ,  for X U Y é 0, d2(0,0) :=0
are metrics. ( X |, is meant the number of elements of a set X, thus the metric di measures the size of the symmetric difference of the sets, while d2 measures the relative symmetric difference.)
Solution. We omit verifications of the first and second conditions from the definition of a metric in exercises on deciding whether a particular mapping is a metric. We analyze the triangle inequality only.
The case (a). For any sets X, Y, Z,
(1)
(xuz)^(xnz) c ((xur)\(xnr))u((yuz)\(ynz))
To show this, suppose first that a is an element satisfying x e X and x </ Z. Then either x e Y in which case x e (Y U Z)\{YnZ),orx <£ F in which case a; e (XuY)^(XnY). It follows that x belongs to the union, that is, the right side of 1. By symmetry, the same result holds if a; is an element satisfying x </ X and x e Z. Since then all possibilities when x belongs to the left side of 1 are accounted for, the inclusion 1 is established. But then,
dx(x,Z) = |(iuz)\(inz) <|((xuľ)\(in Y)) u ((Y u2)\(ľn Z)) < (iuľ)\(inľ)| + |(ľuZ)\(ľnZ)
= d1(XX) + d1(YZ).
The case (b). Proceed similarly to the case of di. Denote by X' the complement of a set X. The equalities
(X U Y) \ (X n Y) = (xnY'nz)u(xnY'nz')u(X'nYnz)u(X'nYnz'),
Minkowski inequality
For every p > 1 and all n-tuples of non-negative real numbers (xu ..., xn) and (yu ..., yn),
l/p        / n        \ l/p       / 71 \1/p
5>i +
Vi
< £*n + £»r
To verify this inequality, we can use the following trick. By Holder's inequality, (recall p > 1)
5>(zi+yi)p-1 < (£.
i=l M=l
and
5>(zi+2/i)p-1 < (E
i/p
i/p
,(p-1)9
(p-1)?
1/9
1/9
Adding the last two inequalities, and taking into account that
p + q = pq, and so (p — l')q = pq — q = p, we arrive at
, n        x l/p        , n       ^ l/p
<'E-n + (£»?
Ľľ=i(^ + yOp
Ľľ=i(^ + yi)p)
1/9
that is
\ 1-1/9        / n        \ i/p       / re       \ i/p
E^+w)p    ^ E-n + UX
or
n x l/p        , n        x l/p       ^ n       ^ l/p
E(^n < E-f   + £tf
i=l ' M = l        ' M = l
since 1 — 1/q = l/p. This is the Minkowski inequality which we wanted to prove.
Thus we have verified that on every finite-dimensional real or complex vector space, there is a class of norms || || for all p > 1. The case p=l was considered earlier. We can also consider p = oo by setting
\\z\\oo = max{|zi|, i = l,...,n}.
This is a norm.
We notice that Holder's inequality can, in the context of these norms, be written for all a = (x1,..., xn), y = (yi,...,y„) as
for all p > 1 and q satisfying l/p + 1/q = 1. For p = 1, we set q = oo.
7.3.5. Lp-norms for sequences and functions. Now we
can easily define norms on suitable infinite-dimensional vector spaces as well. We begin with sequences.
The vector space £p, p > 1, is the set of all sequences of real or complex numbers a0, x1,... such that
488
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
(F U Z) \ (F n Z) =
(xnFnz')u(xnF'nz)u(X nFnz')u(X nF'nz),
((x u z) \ (x n z)) u (f \ (x u z)) = (x n y n Z) u (XnF'nZ)u(X nFnZ)u(X nF'nZ)u(X nFnZ),
which, again, can be proved by listing several possibilities, imply a stronger form of (1), namely
((x u z) \ (x n z)) u (f \ (x u z)) c ((XuF) \ (xnF)) u ((FuZ) \ (Fnz)).
Further, we invoke the inequality
(XuZ)^(XnZ) < xuz I —
xuzu(y\(xuz)J | This is based on calculations with non-negative numbers only since in general
x x + y z     z + y
y > 0, z > 0, x e [0, z].
Since
X U Z U (F \ (X U Z)) = X U F U Z,
we obtain
d2(X,Z)='^^
((XuZ)\(XnZ))u(y\(XuZ))
<
<
xuzu(y\(xuz)J ((^uy)\(xny))u((yuz)s(ynz))
— |
(xuy)\(jyny) | +| (yuz)s(mz)
XU7UZ
<  jxuY)^{xnY)       (yuz)x(ynz) _
— |xuy|       +      | yuz| —
d2(X, Y) + d2(Y,Z),
if X U Z ^ 0 and F ^ 0. However, for X = Z = 0 or F = 0, the triangle inequality clearly still holds.
Therefore, both mappings are metrics. The metric d1 is quite elementary, but the other metric, the metric d2 has wider applications. In the literature, it is also known as Jaccard's metric.3 □
7.E.8. Let
d(x,y) : =
\x-y\
x,y e.
1 + I x — y\ Prove that d is a metric on R.
Solution. We prove the triangle inequality only (the rest is clear). Introduce an auxiliary function
t
(1)
/(*)
t > o.
3It is named after the biologist Paul Jaccard, who described the measure of similarity in insects populations using the function 1 — cfe in 1908.
Y^ \xi \P < °°-i=0
If x = (x1, x2 ...) e £p,p > 1, then the norm is given by
i=0
i/p
That 11 x 11 p is a norm follows immediately from the Minkowski inequality by letting n —> oo.
The vector space l^, is the set of all bounded sequences of real or complex numbers x0, x1,....
If x = (x0, x1,...) G £oo, then its norm is given by
11x1100= sup{|xj|, i = 0,1,2, 3... }
It is easily checked that this indeed a norm.
Eventually, we return to the space of functions 5° [a, 6] on a finite interval [a, 6] or cfP [a, 6] on an unbounded interval. We have already met the L\ norm || ||i. However, for every p > 1 and for all functions in such a space of functions, the Riemann integrals
rb
\f(x)\*dx
b X 1/p
\f(x)fdx
surely exist, so we can define
ll/llP =
The Riemann integral was denned in terms of limits, using the Riemann sums which correspond to splitting E with representatives &. In our case, those are the finite sums
£i/(&)r(*i-*i-i)-
Holder's inequality applied to the Riemann sums of a product of two functions j(x) and g(x) gives
n
Y \f(&)\\9(&)\(xi - xi-i) =
ll
, n . 1/p, n x 1/q
where on the right-hand side, there is the product of the Riemann sums for the integrals 11 /11 and 11 g | | .
Moving to limits, we thus verify Holder's inequality for integrals:
rb /  rb \ 1/jy  rb \ 1/?
f(x)g(x) dx<[    f{xf dx)       / g{xy dx)
which is valid for all non-negative real-valued functions / and g in our space of piecewise continuous functions with compact support.
489
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Note that f(s) - f(r) =
l+s        1+r (l+s)(l+r
whenever s > r > 0.
It follows that / is increasing, a fact which can also be verified by examining the first derivative. Therefore,
d(x, z)
\ x — y + y — z \
<
1 + \ x — z \     l + \x — y + y — z\
\x — y I + I y — z\ 1 + I a; — y\ + \y — z \ \x-y\
+
\y-z\
1 + I a; — y\ + \y — z\     l + \x — y\ + \y — z\
<   \x~y\  _|_  \y~z\
l + \ x — y \     1 + | y — z\ = d(x,y) + d(y,z), x,y,zeR.
□
The metrics in the next problems are denned by norms on vector spaces of functions. See the definitions and discussion in 7.3.1.
7.E.9.   Determine the distance between the functions
f(x) = x,    g(x) = --7=1,    x e
as elements of the normed vector space 5° [1,2] of (piecewise) continuous functions on the interval [1,2] with norm
(a) II/Hi = Ji\f(x) | da;; 0>) ||/||00 = maX{|/(a;)|;a;e[l,2]}.
Solution. The case (a). We need only compute the norm of the difference of the functions
2 2
J\f(x)-g(x) I da; = Jx+ da;
= (22- + vTTx2]' = ! + v/5-v/2.
The case (b). It is necessary to compute
max (x H—,x ,
max I f(x) - g{x)
xG[l,2]
xGfl,2]
Since
x + ■
1 +
it follows that / — g is increasing, and so attains its maximum at the right end point of the interval when x = 2. So
max [x +   . „
x€[l,2] V Vl+x2
= 2 +
> 0, In just the same way as in the previous paragraph, we can derive the integral form of the Minkowski inequality from Holder's inequality:
ll/ + sllP<ll/llP + NIP-
Thus || ||p is indeed a norm on the vector space of all continuous functions having a compact support for all p > 1 (we verified this forp = 1 long ago).
We use the word "norm" for the entire space 5° [a, 6] of piecewise continuous functions in this context; however, we should bear in mind that we have to identify those functions which differ only by their values at points of discontinuity.
Among these norms, the case of p = 2 is special because of the existence of the inner product. In this case, we could have derived the triangle inequality much more easily using the Schwarz inequality.
For the functions from 5° [a, 6], we can define an analogy of the Loo-norm on n-dimensional vectors. Since our functions are piecewise continuous, they always have suprema of absolute values on a finite closed interval, so we can set
ll/IL = sup{/(x), x e [a,b]}
for such a function /. If we considered both the one-sided limits (which always exist by our definition) and the value of the function itself to be the value j(x) at points of discontinuity, we can work with maxima instead of suprema. It is apparent again that it is a norm (except for the problems with values at discontinuity points).
7.3.6. Completion of metric spaces. Both the real numbers R and the complex numbers C are (with the metric given by the absolute value) a complete metric space. This is contained in the axiom of the existence of suprema. Recall that the real numbers were created as a "completion" of the space of rational numbers which is not complete. It is evident that the closure of the set Q C R is R.
Dense and nowhere-dense subsets
We say that a subset A c X in a metric space X is dense if and only if the closure of A is the whole space X. A set A is said to be nowhere dense in X if and only if the set X \ A is dense.
Evidently, A is dense in X if every open set in the whole space X has a non-empty intersection with A. > 0,   a;e[l,2], In all cases of norms on functions from the previous para-
graph, the metric spaces denned in this way are not complete since it can happen that the limit of a Cauchy sequence of functions from our vector space 5° [a, 6] should be a function which does not belong to this space any more. Consider the interval [0,1] as the domain of functions /„ which take zero on [0, l/n) and are equal to sin(l/a;) on [1/n, 1]. They converge to the function sin(1 jx) in all Lv norms, but this func-□   tion does not lie in the spaceo.
= 2 +
V5-
490
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
The Li or L2 distances, discussed in the beginning of this chapter (cf. 7.1.2), reflect the basic intuition about
fr the distance between graphs of the functions. However, in practice we need to understand more subtle concepts of distances. The most obvious way is to include the derivatives in a way similar to the values of the functions.
7.E.10. Consider the space S1 [a, b] of piecewise differen-tiable (real or complex) functions on the interval [a, b] and show that the formula
\\f\\=^\f{x)\2 + a2\f'{x)\2dxy2
with any real a > 0 is a norm on this vector space (up to the identification of functions differing only in the points of discontinuity).
Compute the distance between functions f(x) = sin(a;) + 0.1 [sin(6a;)]2 — 0.03 sin(60a;) and g(x) = sin(a;) on the interval [—tt, it] in this norm and explain its dependence on a.
Solution. The formula
(f,9)^ f I(x)yT^+a2f(x)g7{x)dx
J a
defines a scalar product on S1 [a, b]. The mapping is linear in the first argument /, provides complex conjugate value if the arguments are exchanged and clearly is positive if / = g is non-zero on any interval (Ignore the values in the points of discontinuity, cf. the the discussion in 7.1.2). Thus the corresponding quadratic form defines a norm on the complex vector space S1 [a, b].
The distance in this norm is easily computed to obtain
\/0.02639 + 11.3097a2.
Its dependence on a can be seen in the illustration — the values of the function / (x) are nearly equal to sin x, but the very wiggly difference is well apparent in the derivatives.
Completion of a metric space
Let X be a metric space with metric d which is not complete. A metric space X with metric d such that X c X, d is the restriction of d to the subset X and the closure X in X, is the whole space X is called a completion of the metric space X.
The following theorem says that the completion of an arbitrary (incomplete) metric space X can be found in essentially the same way as the real numbers were created from the rationals.
7.3.7. Theorem. Let Xbea metric space with metric d which is not complete. Then there exists a completion X of X.
Proof. The idea of the construction is identical to the one used when building the real numbers. Two Cauchy sequences x{ and yi of points belonging to X are considered equivalent if and only if d(xi,yi) converges to zero for i approaching infinity. This is a convergence of real numbers, thus the definition is correct.
From the properties of convergence on the real numbers, it is clear that the relation defined above is an equivalence relation. The reader is advised to verify this in detail. For instance, the transitivity follows from the fact that the sum of two sequences converging to zero converges to zero as well.
We define X as the set of the classes of this equivalence of Cauchy sequences. The original points x e X can be identified with the class of sequences equivalent to the constant sequence xi = x, i = 0,1,....
It is now easy to define the metric d. We put
d(x,y) = lim d(xi,yi)
i—Yoo
for sequences x = {x0, x1,... } and y = {yo, y±,... }.
First, we have to verify that this limit exists at all and is finite. Using the triangle inequality, and the fact that both the sequences x and y are Cauchy sequences, it follows that the considered sequence is also a Cauchy sequence of real numbers d(xi,yi), so its limit exists.
If we select different representatives x = {x'0,x[,...} and y = {y'0, y[,... }, then from the triangle inequality for the distance of real numbers (we need to consider the consequences for differences of distances) we see that
\d(xi,y'i) - d(xi,yi) \ < |d(x-,y-) - d(x'i,yi)\ + \d{x'i,yi) - d(xi,yi)\ < d(xi,x'i) + d(yi,y'i).
Therefore, the definition is indeed independent of the choice of representatives.
We verify that d is a metric on X. The first and second properties are clear, so it remains to prove the triangle inequality. For that purpose, choose three Cauchy representatives of
491
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
If a = 0, the distance 0.162 is the usual L2 distance. If q — 1 the distance is 3.367.4 □ Now we move to more theoretical considerations. Though these exercises may not look particularly practical, they should be of help in understanding the basic concepts of metric spaces, the convergence as well as their links to the topological concepts.
7.E.11. Show that the definition of a metric as a function d defined on X x X for a non-empty set X and satisfying
(1) d(x, y) = 0, if and only if x = y,   x, y e X,
(2) d(x,z) <d(y,x)+d(y,z),    x, y, z e X,
is equivalent to the definition given in the theoretical part, in paragraph 7.3.1.
Solution. At first glance, it seems that this definition demands fewer requirements on the metric than the definition from the theoretical part. The two definitions are equivalent if and only if the two conditions of non-degeneracy and triangle inequality imply
(3) d(x,y)>0, x,yeX,
(4) d(x,y) = d(y,x),    x, y G X.
However, if we set x = z in (2), we get the non-negativity of the metric from (1). Similarly, the choice y = z in (2) together with (1) implies that d(x, y) < d(y, x) for all points x,y e X. Interchanging the variables x and y then gives d(y, x) < d(x, y), i.e. (4). Thus, it is proved that the definitions are equivalent. □
F. Convergence
7.F.I. Describe all sequences in a discrete metric space X, which are convergent or Cauchy.
Solution. Since the distance between two points x, y in X is either 1 or zero, the sequence xi,x2, ■ ■ ■ is Cauchy if and only if all Xi are, equal, except for a finite number of them. But then, the sequence is convergent. □ This problem shows a behaviour quite different from the convergence of sequences in the metric spaces X = R or X = C. But sequences of integers would behave in a very
^ere is an illustration of the very important concept of Sobolev spaces, where any number of derivatives can be involved. Moreover, we can use Lp, p > tin the definition of the norm instead of p = 2. There is much literature on this subject.
the elements x, y, z, and we obtain
d(x, z) = lim d(xi,Zi)
i—Yoo
< lim d(xi,yi) + lim d(yi,Zi) i—yoo i—yoo
= d{x,y) + d{y,z).
The restriction of the metric d just defined to the original space X is identical to the original metric because the original points are represented by constant sequences.
It is required to prove that X is dense in X. Let x = {x{} be a fixed Cauchy sequence, and let e > 0 be given. Since the sequence x{ is a Cauchy sequence, all pairs of its terms xn, xm for sufficiently large indices m and n become closer to each other than e. Then the choice y = xn for one of those indices necessarily implies that the elements y and xm are closer together than e, and so, d(y, x) < e. Hence there is an element y of the original space such that the distance of the sequences of y's from the chosen sequence x{ does not exceed e. This establishes the denseness of X.
It remains to prove that the constructed metric space is complete. That is, that Cauchy sequences of points of the extended space X with respect to the metric d are necessarily convergent to a point in X. This can be done by approximating the points of a Cauchy sequence Xk by points yk from the original space X so that the resulting sequence y = {yi} would be the limit of the original sequence with respect to the metric d.
Since X is a dense subset in X, we can choose, for every element ik of our fixed sequence, an element Zk £ I so that the constant sequence Zk would satisfy d(xk, Zk) < 1/k. Now consider the sequence z = {z0, z1,...}. The original sequence x is Cauchy. So for a fixed real number e > 0, there is an index n(e) such that d(xn, xm) < e/2 whenever both m and n are greater than n (e). Without loss of generality, the index n(e) is greater than or equal to 4/e. Now, for m and n greater than n(e), we get:
d(zm, zn)    d(lzm, zn)
< d(~Zm, Xm) + d(xm, Xn) + d(xn, Zn)
< 1/m + e/2 + 1/n < 1- + | = e.
Hence it is a Cauchy sequence z{ of elements in X, and so z e X. From the triangle inequality,
d{z,xn) < d(z,zn) + d(z
71, -^71) •
From the previous bounds, both terms on the right-hand side converge to zero. Hence the distances d(xn, z) approach zero, thereby finishing the proof. □
We consider now the uniqueness of the completion of metric spaces.
A mapping ip : X\ —> X2 between metric spaces with metrics d1 and d2, respectively, is called an isometry if and
492
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
similar way. On the other hand, we deal mostly with metrics on spaces of functions, where intuition gained in the real line R may be useful.
7.F.2. Determine whether or not the sequence {xn}n€N where
xi = l,   xn = 1+_+••• +±, neN\{l}, is a Cauchy sequence in R using the standard metric. Solution. Recall that
oo ^
oo, i.e.
oo ^
~S~~] — = oo, m G N.
rZ
(1)
fc=l
Therefore,
oo
lim | xn - xm | =    ___    i = 00,    m G N.
Hence the sequence {xn} is not a Cauchy sequence.
Alternatively, {xn} is not a Cauchy sequence, since if it is, then it is convergent in the complete metric space R, which contradicts the divergence shown in (1). □
7.F.3. Repeat the question from the previous problem with the metric d given by (cf. 7.E.8)
\x-y\
d(x,y) :=
xpy G .
1 + | x — y\
Solution. Instead of repeating the arguments, we point out the difference between the given metric from the standard one. The difference is expressed by the function / introduced in (1). This is a continuous function and, moreover abijection between the sets [0, oo) and [0,1), having the property that /(0) = 0. Further, the property of a sequence being Cauchy or convergent in a metric space is denned by being Cauchy or convergent for the real numbers describing the distances between the elements in the sequence. But the continuous mappings preserve convergence or the property being Cauchy, and hence the solution for the new metric is the same as with the standard one. □
7.F.4. Determine whether or not the metric space C[—l, 1] of continuous functions on the interval [—1,1] with metric given by the norm
(a) \\f\\p=(jL1\f(.x)\'>dxy/PfBrp>l; 0>) ||/||„ = max{|/(z)|;ze[-l,l]} is complete.
Solution. The case (a). For every n G N, define a function
fn(x) = 0, x G [-1, 0),    fn(x) = l,xe (i, 1], fn(x) =nx, x G [0, i].
only if all elements x,y G X satisfy d2(p(x),p(y)) = di{x,y).
Of course, every isometry is a bijection onto its image (this follows from the property that the distance between distinct elements is non-zero) and the corresponding inverse mapping is an isometry as well.
Now, consider two inclusions of a dense subset, i\ : X —> X1 and i2 : X —> X2, into two completions of the space X, and denote the corresponding metrics by d, g_, and g_, respectively. The mapping
p- HPQ —
■X-
■x2
is well-defined on the dense subset i\ (X) c X1. Its image is the dense subset l2(X) c X2 and, moreover, this mapping is clearly an isometry. The dual mapping i\ o l2 1 works in the same way.
Every isometric mapping maps, of course, Cauchy sequences to Cauchy sequences. At the same time, such Cauchy sequences converge to the same element in the completion if and only if this holds for their images under the isometry p. Thus if such a mapping p is denned on a dense subset X of a metric space X\, then it has a unique extension to the whole X1 with values lying in the closure of the image p(X), i. e. X2.
By using the previous ideas, there is a unique extension of p to the mapping p : X1 ^ X2 which is both a bijection and an isometry. Thus, the completions X1 and X2 are indeed identical in this sense.
Thus it is proved:
7.3.8. Theorem. Let Xbea metric space with metric d which is not complete. Then the completion X of X with metric d is unique up to bijective isometries.
In the following three paragraphs, we introduce three theorems about complete metric spaces. They are highly applicable in both mathematical analysis and verifying convergence of numerical methods.
7.3.9. Banach's contraction principle. A mapping F X —> X on a metric space X with metric d is called a contraction mapping if and only if there is a real constant 0 < C < 1 such that for all elements x, y in X,
d(F(x),F(y))<Cd(x,y).
Theorem. If F is a contraction mapping on a complete metric space X, then it has a fixed point, i. e., there is a z G X such that F(z) = z.
Proof. The proof naturally follows the intuitive idea that iterative application of a contraction mapping starting from an initial value zq G X should "accumulate" to some point. The metric space X, of course, needs to be complete; otherwise it could happen that the limit point does not exist in it.
493
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
For every m>n,ra,n£N,we compute the inequality
/ 1 \ l/P        /1/n \ 1/P
(yi/m(z) -/„(*) I'd*)        ^/ldz) =
It follows that the sequence {/n}n6n C C[—l, 1] is a Cauchy sequence of functions.
Suppose the sequence {/„} has a || • || limit / inC[—1,1]. We show that this limit cannot be continuous in x = 0. For every e G (0,1), there exists an n(e) G N such that
fn(x) = 0,  X G [-1,0],     fn(x) = 1,  X G [ff, 1]
for all n > n(e). Imagine, f(y) ^ 1 at some y > e. Then II/— /nil > <5 > Oforalln > n(e) and some 5, since/is continuous. Thus / 7^ 1 on some bounded interval containing y. Therefore / must satisfy
/W = 0,x£[-l,0j,   f(x) = 1, x G [e, 1]
for an arbitrarily small e > 0. Thus, necessarily,
/(aO = 0, zG [-1,0],   /(x) = l,^(0,lj.
But this function is not continuous on [—1,1], so it does not belong to the considered metric space. Therefore, the sequence {/„} does not have a limit in C[—l, 1], so this space is not complete.
The case (b). Let an arbitrary Cauchy sequence {/n}nGN C C[—l, 1] be given. The terms of this sequence are continuous functions /„ on [—1,1] having the property that for e > 0 (or for every e/2 if you want) there is an n{e) G N such that
£
(1)      max    fm(x) - fn(x) | < -, m,n>n(e). xe[-i,i] 2
In particular, for every x G [—1,1], we get a Cauchy sequence {/n (x) }nGN C Kof numbers. Since the metric space R with the usual metric is complete, every (for x G [—1,1]) sequence {jn(x)} is convergent. Set
f(x) := lim fn(x),    x G [-1,1].
n—>-oo
Letting m^ooin(l), we obtain
max   | f(x) — fn(x) !<§<£,    n > n(e). xe[—1,1]
It follows that the sequence {/n}nGN converges uniformly (that is, with respect to the given norm), to the function / on [—1,1]. Since the uniform limit of continuous functions is continuous, so is /, so / G C[—l, 1], see 6.3.4. Therefore, the metric space is complete. □ The same reasoning as above, and hence the same results, apply to the more general metric space C[a, b] of continuous
Choose an arbitrary z0 G X and consider the sequence
zu i = 0,1,...
2i = F(z0), z2 = F{z{),..., zi+i = F(zi),... From the assumptions, we have
d(zi+1,Zi) = d(L,(2:i),L,(2:i_i))
< Cd{zi,zi-1) < ■■■ < Od{Zl,z0).
The triangle inequality then implies that for all natural numbers j,
d(;
'■i+j, ■
< y^d(zi+k,Zi+k-i)
k=l
j j
< £ d(Zl, z0) = C d(Zl,z0) £ &-1
<Cd(zllzQ)Y^Ck-1 =^-75d(z1,
Now, since 0 < C < 1, lim„_^ (J1 = 0, so for every
positive (no matter how small) e, the right-hand expression is surely less than e for sufficiently large indices i, that is,
d(zi,zi+j) < 1_c d(z1,z0) <e.
However, this ensures that thre sequence zi is a Cauchy sequence. Since X is complete, the sequence has a limit z, and all that remains to be proved is F(z) = z.
Every contraction mapping is continuous. Therefore,
F(z) = F( lim zn) = lim F(zn) = z.
This finishes the proof.
□
The next two theorems extend the intutive understanding jjfi i, of "density" of closed intervals [a,b] c R, not allowing for any "holes" there. They are essential for the understanding of compactness of metric spaces. In fact, they are both special cases of more general theorems on topological spaces.
7.3.10. Cantor intersection theorem. For any set A in a
metric space X with metric d, the real number
diamA = sup d(x,y)
x,y£A
is called the diameter of the set A. The set A is said to be bounded if and only if diamA < oo.
Theorem. If A\ d A2 d • • • d Ai d ... is a non-increasing sequence of non-empty closed subsets in a complete metric space X and if ^diam Ai —> 0, then there is exactly one point x G X belonging to the intersection of all the sets Ai.2
Georg Cantor is considered as the founder of the set theory which he introduced and developed in the last quarter of 19th century. At this time, the new abstract approach to fundamentals of Mathematics caused fierce objections. It also lead to the severe internal crises of Mathematics in the beginning of the 20th century. This part of the history of Mathematics is fascinating.
494
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
functions on any closed bounded interval [a, b] or on the space Cc of continuous functions with compact support.
7.F.5.   Prove that the metric space £2 is complete. Solution. Recall that £2 is the space of sequences of real numbers with the L2-norm, see 7.3.5.
Consider an arbitrary Cauchy sequence {in}n6n in the space £2. Every term of this sequence is again a sequence, i.e., xn = {xfykeN, n £ N. Of course, the range of indices does not matter - there is no difference whether n, k £ N or n,k £ Nu{0}. Introduce auxiliary sequences yk for k £ N so that
Vk = {vt}n&\ = {Xn}n&H-
If {xn } is a Cauchy sequence in £2, then each of the sequences yk is a Cauchy sequence in R (the sequences yk are sequences of real numbers). It follows from the completeness of R (with respect to the usual metric) that all of the sequences yk are convergent. Denote their limits by zk, k £ N.
It suffices to prove that z = {z^jkeii £ £2 and that the sequence {xn} converges for n —> 00 in £2 just to the sequence z. The sequence {xn}n€N c £2 is a Cauchy sequence; therefore, for every e > 0, there is an 71(e) £ N with the property that
(xm ~ xn)   < e>    m,n> 71(e), m,n £ N.
fc=i In particular,
l 2 Y^(Xm~Xn)    < e'     771, 71 > 7l(e), 771, 71, / £ N, fc=l
whence, letting m —> 00,
z 2
EC^-^n)  <e,   71 > 71(e), 71, / £ N, fc=i
i.e. (this time / —> 00)
00
(1)        ^ (zfc - xkn)2 < e,   71 > 71(e), 71 £ N.
k=l
Especially,
00 2
X] (Zfc — Xk)    < OO,     71 > 7l(e), 71 £ N k=l
and, at the same time,
00 2
J2 (Xn)    < OO,     71 £ N, k=l
which follows straight from {i„}nen c £2.
Since (cf. the special case of Holder's inequality forp = 2 in 7.3.4)
OO / OO / OO
E       < JE 4■ JE (4) , «eN fc=i y k=i    y ?c=i
and
Proof. Select one point x{ for each set A{. Since diam A{ —> 0, for every positive real number e, we can find an index 71(e) such that for all A{ with indices i > 71(e), their diameters are less than e. For sufficiently large indices i, j, d(xi,Xj) < e, and thus our sequence is a Cauchy sequence. Therefore, it has a limit point x e X. x must be a limit point of all the sets A{, thus it belongs to all of them (since they are all closed). So x belongs to their intersection. This proves the existence of x.
Assume there are two points x and y, both belonging to the intersection of all the sets A{. Then d(x, y) must be less than the diameter of the sets A{. But diam A{ —> 0, so d(x, y) = 0, hence x = y. This proves the uniqueness of x. □
7.3.11. Theorem (Baire theorem). If X is a complete metric space, then the intersection of every countable system of open dense sets A{ is a dense set in the metric space X?
Proof. Suppose X contains a system of dense open sets
jSL       ^4i, 2 = 1,2____It is required to show that the
•<l\ set A = Cifl-^Ai has a non-empty intersection ^-^j£rfTr* with any open set U c X. Proceed inductively, invoking the previous theorem.
Since Ai is dense in X, there is a point z1 £ A1 n U. Let Ui be an open ball, centre z1, of positive radius e\, such that its closure B1 is contained in U.
Suppose the points z{ and their open ei-neighbourhoods Ui are already chosen for i = 1,..., 71 with Zi £ Ai n Ui.
Since the set An+1 is open and dense in X, there is a point zn+1 £ An+1 n Un; however, since An+1 n Un is open, the point zn+1 belongs to it together with a sufficiently small e„+1 -neighbourhood Un+i-
Since A1 is dense, there is a z1 £ A1 n (7, but since the set yli is open, the closure of an e\-neighbourhood U\ (for sufficiently small e{) of the point z1 is contained in A1 as well. Denote the closure of this ei-ball Ui by B\.
Further, suppose that the points Zi and their open ei-neighbourhoods Ui axe, already chosen for i = 1,..., 71. Since the set An+1 is open and dense in X, there is a point zn+1 £ An+1 n Un; however, since An+i n Un is open, the point zn+i belongs to it together with a sufficiently small en+1 -neighbourhood Un+i-
Then, the closures surely satisfy Bn+i = Un+i £ Un, and so the closed set Bn+i is contained in An+i n Un. Moreover, we can assume that en < 1 /n.
If we proceed in this inductive way from the original point z1 and the set B\, we obtain a non-decreasing sequence of non-empty closed sets Bn whose diameter approaches zero. Therefore, there is a point z common to all of these sets. That is,
2 £ nZi^ = n^Bi g nZiAn n U
This theorem is a part of considerations by Rene-Louis Baire in his 1899 doctoral thesis. More generally, a topological space satisfying the property as in the theorem is called a Baire space and the theorem simply says that every complete metric space is a Baire space.
495
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
E (** - 4) = E K - 2^4 + K) ),  n e N.
fc=l k=l
which is the statement to be proved.
□
Hence
< oo.
E4
k=l
It is proved that z e £2. The fact that {xn} converges for n —> oo to z in l2 follows from (1). □ The next problem addresses the question of the power of different metrics on the same space of functions in terms of convergence. We deal with the space Sc of piecewise continuous functions with compact support, equipped with the Lp metrics. We write briefly Cp for these metric spaces. In particular, we show that convergence in Cp for some positive p does not always imply convergence in Cq for another positive q^p.
7.F.6.   Let 0 < p < oo. For each positive integer n, define the sequence of functions
iLlp   -\/n<x <\/n 0 otherwise.
Decide for which q the sequnce /„ converges in Cq.
Solution. Let q be any positive real number. Then
rl/n
\fn(x)\qdx = /      nq'TJdx = (2/n)nq'p = 2n^-1.
J-l/n
If 0 < q < p and if n —> oo, then
\fn(x)\"dx^0.
J —oo
So || fn \\q —> 0, and the sequence converges to the zero function. Similarly if 0 < p < q and if n —> oo, then
\fn(x)\qdx
fn(x) =
So || /„ || diverges, and in particular, /„ cannot converge to any limit.
Finally, for q = p we have J \fn(x)\pdx = 2 for all positive integers n, and so as n —> oo we get || /„ || —> 2xlp. At the same time, for any g e Sc, if g(x) ^ 0 at some x ^ 0 where g is continuous, its distance from /„ cannot converge to zero.
It follows that /„ converges to 0 in Cq, 0 < q < p, but it does not converge in Cq with q > p. □
The next problem deals with the extremely useful Banach fixed point theorem, showing the necessity of all the requirements in Theorem 7.3.9.
7.3.12. Bounded and compact sets. The following concepts facilitated our discussions when dealing (d ~~±Z-3      with the real and complex numbers. They can be reformulated for general metric spaces with almost no change: An interior point of a subset A in a metric space is such an element of A which belongs to it together with some of its e-neighbourhoods.
A boundary point of a set A is an element x e X such that each its neighbourhood has a non-empty intersection with both A and the complement X \ A. A boundary point may or may not belong to the set A itself.
A limit point of a set A is an element x equal to the limit of a sequence x{ e A, such that x{ ^ x for all i. Clearly a limit point may or may not belong to the set A.
An isolated point of a set A is an element a e A such that one of its e-neighbourhoods in X has the singleton intersection {a} with A.
An open cover of a set A is a system of open sets Ui C X, i e I, such that their union contains A.
Compact sets
A metric space X is called compact if every sequence of points x{ e X has a subsequence converging to some point
xex.
Any subset A c X in a metric space is called compact if it is compact as the metric space with the restricted metric.
Clearly, the compact subsets in discrete metric spaces X are exactly the finite subsets of X.
In the case of the real numbers R, our definition reveals the compact subsets discussed there and we would also like to come to useful properties as we did for real numbers in the paragraphs 5.2.7-5.2.8. It is suprisingly easy to see, that the continuous functions behave similarly on compact sets in general:
Theorem. Let f : X —> Y be a continuous mapping between metric spaces. Then the images of compact sets are compact.
Proof. Recall that any convergent sequence of points x{ —> s in 1 is mapped onto the convergent sequence f(xi) —> f(x) in Y. Thus, the statement follows immediately from our definition of the compactness via convergent subsequences. □
In particular we obtain the most useful consequence on the minima and maxima of continuous functions on compact subsets:
Corollary. Let f : X —> R be a real function defined on a compact metric space. Then there are the points x0 and yo in X such that
f(x0) = maxxeX{f(x)},    /(y0) = minxeX{f(x)}.
496
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.F.7. Show that the mapping / : (0,00} —> (0, 00} given by
f(x)=x + e-x satisfies, for all x ^ y, the condition
l/faO - f(y)\ <\x- y\,
but it does not have any fixed point, i.e. f(x) =^ x for any x. (Thus the condition \f(x) — f(y) <C\x — y), with constant C < 1, in the Banach fixed point Theorem 7.3.9 is essential.)
Solution. Clearly the function / is strictly increasing on the entire domain. Assume y < x. Then e~x < e~v and
\f(x) - f(y)\ = \x -y + e~x-e-y \ < \x-y\.
Finally, f(x) = x implies e~x = 0 which is impossible. □
G. Topology
Dealing with convergence of real numbers, we observe that the topological concepts of open neighbour-
fr hoods and the compactness are most useful. This is of course even more true for metric spaces, where we can work with open balls of radius r etc. The definitions remain essentially the same.
7.G.I. We have already seen the discrete metric space 1/8 with the metric d : X x X -> R defined by the formula
d{x,y) := 1, x ^ y,   d(x, y) := 0, x = y .
(a) Decide whether (X, d) is complete.
(b) Describe all open, closed, and bounded sets in (X, d).
(c) Describe the interior, boundary, limit, and isolated points of an arbitrary set in (X, d).
(d) Describe all compact sets in (X, d).
Solution. The case (a) was essentially dealt with in 7.F.I. For an arbitrary sequence {sn}n6N to be a Cauchy sequence, it is necessary in this space that there is an index n e N such that xn = xn+m for all m e N. Any sequence with this property then necessarily converges to the common value xn = xn+1 = • • • (we talk about almost stationary sequences). So the metric space (X, d) is complete.
The case (b). The open 1-neighbourhood of any element contains this element only. Therefore, every singleton set is open. Since the union of any number of open sets is an open set, every set is open in (X, d). By complements,
Proof. The image f(X) must be a compact subset in R, thus it must achieve both maximum and minimum (which are the supremum and the infimum of the bounded and closed image). □
The concept of boundedness is a little more complicated in the case of general metric spaces. For any point x and subset B c X in a metric space X with metric 0!, we define their distance4
dist(x,B) = inf {d(x, y)}. yeB
We say that a metric space X is totally bounded if for every positive real number e, there is a finite set A such that
dist(a;, A) < e
for all points x e X. We call such an A an e-net of X.
Note that a metric space is bounded if X has a finite diameter. We can immediately see that a totally bounded space is always bounded. Indeed, the diameter of a finite set is finite, and if A is the set corresponding to e from the definition of total boundedness, then the distance d(x, y) of two points can always be bounded by the sum of dist(a;, A), dist(y, A), and diam A, which is a finite number.
In the case of a metric on a subset of a finite-dimensional Euclidean space, these concepts coincide since the boundedness of a set guarantees the boundedness of all coordinates in a fixed orthonormal basis, and this implies total boundedness. (Verify this in detail by yourselves!)
The next theorem provides the promised very useful alternative characterisations of compactness:
7.3.13. Theorem. The following statements about a metric space X are equivalent:
(1) X is compact;
(2) every open covering of X, X = VJiejUi, contains afinite covering X = Uk=1Ujk, where all j k G I;
(3) X is complete and totally bounded.
Proof. We show consecutively the implications (1) => (3) => (2) => (1).
(1) => (3). Assume X is compact. Then for each Cauchy sequence of points Xi, there is a sub-sequence Xin converging to a point x e X. We just have to verify that the initial sequence also converges to the same limit, Xi —> x. This is easy and we leave it to the reader. So X is complete.
Suppose X is not totally bounded. Then there is e > 0 such that no finite e-net exists in X. Then there is a sequence of points Xi such that d(xi,Xj) > e for all i ^ j. (Verify this almost obvious claim - look at the definition of e-nets!) Then this is a sequence of points, where no sub-sequence can be a
Notice, that the distance between two subsets A, B c X should express how "different" they are. Thus we define the (Hausdorff) distance as follows dist(A, B) — max{sup{dist(a;, B), x G A}, sup{dist(y, A), y G B}}.
This difference is finite for bounded sets and it is easy to see that it vanishes if and only if the closures of A and B coincide.
497
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
this also means that every set is closed. The fact that the 2-neighbourhood of any element coincides with the whole space implies that every set is bounded in (X, d).
The case (c). Once again, we use the fact that the open 1-neighbourhood of any element contains this element only. It follows that every point of any set is both its interior and its isolated point and that the sets have neither boundary nor limit points.
The case (d). Every finite set in an arbitrary metric space is compact (it defines a compact metric space by restricting the domain of d). It follows from the classification of convergent sequences (see (a)), that no infinite sequence can be compact in (X, d). □
7.G.2. In the metric space <S[— 1,1 ] with metric given by the norm || ■ H^, consider the sets
A ={/€5[-l,l];/(0)e(0,2)},
B = {/eS[-l,l]; J f(x) dx = 0}. -l
Are these sets open? Are these sets closed? Solution. The interior of a set M is the set of all interior points of M and it is usually denoted by M°. A set M is then open if and only if M = M°. Similarly, we define the closure of a set M as the set of all points having zero distance from M; it is denoted by M. A set M is closed if and only if M = ~M. Since
4» = Al = {/^-M];/(0)ep,2]},  BP =
®,B = B
(especially, A contains functions / for which /(0) can attain values from the whole closed interval [0,2]), the set A is open and not closed, and, the other way around, the set B is closed and not open. □ One of the most important concepts related to complete metric spaces is given by the principle of nested balls (cf. the theorem 7.3.10). Under some additional conditions, it says that a metric space (X, d) is complete if and only if every sequence {An}nem of nested (i.e., An+1 C An, neK) nonempty closed sets An has non-empty intersection. That is,
(1) p| An 0.
7.G.3.   Verify that the additional condition in the theorem
(1) lim sup {d(x, y); x,y e An} = 0.
n—too
cannot be omitted.
Cauchy sequence, so X is not compact. This contradicts (1), so we conclude that X is totally bounded.
The next implication, namely (3) => (2) is more demanding. So assume X is complete and totally f,,, bounded, but X does not satisfy (2).
Then there is an open covering Ua, a e I, of X, which does not contain any finite covering. Choose a sequence of positive real numbers ek —> 0 and consider the finite e^-nets from the definition of total bound-edness. Further, for each k, consider the system Ak of closed balls with centres in the points of the e^-net and diameters 2ek. Clearly each such system Ak covers the entire space X. Altogether, there must be at least one closed ball C in the system Ai which is not covered by a finite number the sets Ua. Call it C\ and notice that diamC\ = 2e\.
Next, consider the sets C\ n C, with balls C e A2 which cover the entire set C\. Again, at least one of them cannot be covered by a finite number of Ua, we call it C2. This way, we inductively construct a sequence of sets Ck satisfying Cfc+i C Ck, diam Ck < 2ek, £k —> 0, and none of them can be covered by a finite number of the open sets Ua.
Finally we choose one point Xk £ Ck in each of these sets. By construction, this must be a Cauchy-sequence. Consequently, this sequence of points has a limit x since X is complete. Thus there is Uaa containing x and containing also some ^-neighbourhood B$(x). But now, if diamCfc < 2ek < S, then Ck C Bg(x) c Uaa, which is a contradiction.
The remaining step is to show the implication (2) =>■ (1). Assume (2) and considering any sequence of points Xi £ X, we set Cn = {xk\ k >n}. The intersection of these sets must be non-empty by the following general lemma:
Lemma. Let X be a metric space such that property (2) in the Theorem holds. Consider a system of closed sets D a, a £ I, such that each its finite subsystem Dai,. empty intersection. Then also
, Dak has non-
This simple lemma is proved by contradiction, again. If the latter intersection is empty, then
x = x\ (naeIDa) = uaeI(x \ Da) = yjaeIva,
where VQ = X\Da are open sets. Thus, there must be a finite number of them, {Vai, VQn }, covering X too. Thus, we obtain
X = U?=1KQi = U?=1(X \ DaJ = X \ (n?=1£>Qi).
This is a contradiction with our assumptions on Da and the lemma is proved.
Now, let x £ n^LjCVj. By construction, there is a subsequence xnk in our sequence of points xn £ X, so that d(xnk ,x) < 1/k. This is a converging subsequence, and so the proof is complete. □
As an immediate corollary of the latter theorem, each closed subset in a compact metric space is again compact.
498
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Solution. That the requirement (1) cannot be omitted is probably contrarily to many readers' expectations. For a counterexample, consider the set X = N with metric
d(m, n) = 1 + mjj_n, m 7^ n,    d(m, n) = 0, m = n.
It is indeed a metric. The first and second properties are clearly satisfied. To prove the triangle inequality, it suffices to observe that d(m, n) G (1,4/3] if ra ^ n. Hence the only Cauchy sequences are those which are constant from some index on. These sequences are constant except for finitely many terms, sometimes called almost stationary sequences. Thus, every Cauchy sequence is convergent, so the metric space is complete. Define
An := {m G N; d(m, n) < 1 + ^} ,   ti G N.
As the inequality in their definition is not strict, it is guaranteed that they are closed sets. Since An = {ti, n + 1,... }, it follows that {An} are nested, but with empty intersection (contrary to (1)). If the requirement (1) is omitted, then the metric space is not complete, contradicting the data. Of course, in this case the condition (1)) is not met, as
lim sup {d(x,y); x,y G An}
n—yoo
l
= lim    1 +
2n + 1
= 1/0.
□
7.G.4. Determine whether the set (known as the Hilbert cube)
A = {{xn}neN G £2; | xn | < i, n G N} is compact in £2. Then determine the compactness of the set
B = {{xn}neN G £oo; | xn | < i, n G N} in the space .
Solution. The space l2 is complete (see 7.F.5). Every closed subset of a complete metric space defines a complete metric space. The set A is evidently closed in l2, so it suffices to show that it is totally bounded, and from the theorem 7.3.13(3) it is compact. To do that, construct an e-net of A for any given e > 0: Begin with the well-known series
oo „
Y^ J- — zL 2^ k2 — 6 k=l
(see (1)).
For every e > 0, there is an n(e) G N satisfying
For subsets of a totally bounded set are totally bounded, and closed subsets of a complete metric space are also complete.
Another consequence is an alternative proof that a subset K c R™ is compact, if and only if it is closed and bounded.
Notice also that while the conditions (1) and (3) are given in terms of the metric, the equivalent condition (2) is purely topological.
7.3.14. Continuous functions. We revisit the questions related to continuity of mappings between metric spaces. If fact, many ideas understood for the functions of one real variable generalize naturally.
In particular, every continuous function / : X -+ R on a compact set X is bounded and achieves its maximum and minimum. Indeed, consider the open intervals Un = (n — l,n+l) c R, 7i G Z covering R. Then their preimages S^iUi) cover X, so that there is a finite number of them, covering X as well. Thus / is bounded and the supremum and infimum of its values exist. Consider sequences f(xn) and f(yn) converging to the supremum and infimum, respectively. Then there must be covergent subsequences of the points xn and yn in X and their limits x and y are in X too. But then f(x) and f(y) are the supremum and infimum of the values of / since / is continuous and thus respects convergence.
We should also enjoy to see the differences between the "purely topological" concepts, as the continuity (possibly defined merely by means of open sets), and the next stronger concepts, which are "metric" properties.
Uniformly continuous mappings
2~2 k2
k=n(e) + l
A mapping / : X —> Y between metric space is called uniformly continuous, if for each e > 0 there is a S > 0, such that dy (/(x), / (y)) < e for all x, y G X with dx(x, y) < S. _
Notice that this requirement on the uniform continuity of / is equivalent to the condition that for each pair of sequences Xk and yk in X, dx(xk,yk) —> 0 implies dY(f(xk)J(yk)) -s-0.
This observation leads to the following generalization of the behavior of real functions:
Lemma. Each continuous mapping f : X —> Y on a compact metric space X is uniformly continuous.
Proof. Assume / is a continuous function. Consider any two sequences Xk and yk with d(xk, yk) —> 0.
Since X is compact, there is a subsequence of Xk converging to some point x G X and so we may assume Xk —> x, without loss of generality. Now, dx(x,yk) < dx(x,Xk) + dx(xk, yk) -> 0 and so lim^oo yk = x, too.
Next, notice that the metric dy : Y x Y —> R is always a continuous function (cf. the problem 7.E.1 in the other column). But then the continuity of / ensures
lim dy(f(xk), f(yk)) = dy(f(x), f(x)) = 0.
499
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
From each of the intervals [—\/n,\/n] for n      e      {1,n(e)}, choose   finitely   many points
Xi,..., x™ („) so that for any x e [—l/n, l/n] that min       x — x™   < —§=.
j€{l.....ra(n)} 1 J  1 V5"
Consider such sequences {yn}nGN from l2 whose terms with indices n > n(e) are zero, and at the same time,
r- /   1 1       \ r- /   n(e) n(£) x
Vl t  \ ■ ■ ■ ,Zm(l) J , . . • , y„(e)  fc  <yX1 ,Xm(rl(e)) j".
There are only finitely many such sequences and they create the desired e-net for A: let xn e l2 is arbitrary. According to our choice of the sequences yn, there is yn such that
d(xn,yn) =
\
__>k-
Vk)
k=l
<
<
< e
\ k=l
__>k
Vk)2 +
\
E
+
e
k=n(e) + l
e
5«(e) + 2
1
T-l+2=e-
Since e > 0 is arbitrary, the set ^4 is totally bounded, which implies compactness.
The closure of the set B is
B = {{xn}n&i e loo; | %n | <     <J G N} . Hence B is not closed, and so it is not compact. The set B is compact. The proof of this fact is much simpler than for the set A, thus we leave it as an exercise for the reader. □
7.G.5. Prove that on each metric space X, the given metric d is a continuous function IxI->R, O
7.G.6. Show that if F is a continuous mapping on a compact metric space X, then the inequality
d(F(x),F(y))<d(x,y), for all x ^ y, implies the existence of a fixed point. Solution. The infimum a of the values of the continuous function d(x, F(x)) must be achieved in a point x0 e X (see 7.3.12 for the concepts and main results and use the previous result 7.G.4). Since distances are non-negative, a > 0. If a/0, then
d(F(x0),F(F(x0))) < d(x0,F(x0)) = a which is a contradiction. □
By the latter observation, this is equivalent to the uniform continuity of /. □
A very useful variation on the theme of continuity is the following definition.
Lipschitz continuity
A function / : X —> Y between metric spaces is called Lipchitz continuous if there is a constant C > 0 such that for all points x,y e X
dY{f{x),f{y))<Cdx{x,y)). Every Lipschitz continuous function is uniformly contin-
7.3.15. Arzela-Ascoli Theorem. To conclude, we consider some spaces of functions. These provide examples of how much they may differ from the usual Euclidean spaces. First, we introduce some terminology. Basically we want to deal with functions, which are all uniformly continuous in the very same way:
Equicontinuous functions
Consider a space M of mappings / : X —> Y between matric spaces. We say that the functions in M are equicontinuous, if for each e > 0 there is a S > 0, such that
dY(f(x),f(y)) < e for all 2,y e X with dx(x,y) < S, for all functions f e M.
Consider the metric space C(X) of all continuous (real or complex) functions on a compact metric space X, with its || | |oo norm. This means that the distance between two functions f,gis the maximum of the distance between their values j(x) and g(x) for x e X.
We say that a set M C C(X) of real functions is uniformly bounded, if there is a constant K e K such that |/(a:) | < K for all functions f e M and points x e X. Of course, bounded sets M of functions in C(X) are always uniformly bounded, by the definition of the norm.
Theorem. Consider a compact metrix space X. A set M C C(X) in the space of continuous functions with the supremum norm \\\\oo is compact if and only if it is bounded, closed, and equicontinuous.5
Proof. Suppose M is compact.   Then M is totally bounded (and thus also uniformly bounded as \i, noticed above). Since every compact subset is closed, it remains to verify the equicontinuity.
A weaker version providing a sufficient condition was first published by Ascoli in 1883, the complete exposition was given by Arzela in 1895. Again, there are much more general versions of this theorem in the realm of topological spaces.
500
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Given e > 0, consider the corresponding e-net (fi, f2, ■ ■ ■, fk) C M from the definition of the total boundedness of M. Recall that all the functions /. are uniformly continuous (as continuous functions on a compact set). Thus there is a <5. for each /., such that    (2, y) < S{ implies
\fi(x) - fi(y)\ <£■
Of course, we take S to be the minimum of the finite many Si,i = l,...,k. Then the same equality holds for all /. in the e-net. But now, considering an arbitrary function f e M, there is a function/j in our e-net with ||/—/_,-1| and so dx(x, y) < S implies that \f(x) — f(y) | is at most
1/(2) - f(x)\ + \fj(x) - /, (y)| + Iffy) - f(y)\ < 3e,
and the equicontinuity has been proved.
Conversely, suppose that M is a bounded, closed, and \\ equicontiuous subset oiC(X), with X as a compact metric space. * till First we show that M is complete. This was W shown in the case when X is a closed bounded interval of reals in the problem 7.F.4. Exploit the equicontinuity to see that the limit function / is again continuous. The same argument works in general.
Thus, we need to find a Cauchy (sub)sequence within any sequence of functions fn e M.
The compact space X itself is totally bounded and therefore it contains a countably dense set A c X (we may take the points in all 1/fc-nets for k e N). Write A = {a±, g_, ... } as a sequence.
Choose a subsequence of functions fij, j = 1,2,... within the functions /„, so that the sequence of values fij (ai) converges. (This is possible, since the set M is bounded in the || | |oo norm). Similarly, the subsequence f2j can be chosen from fik, so that f2j(a2) converges. In general, the m-th subsequence is chosen from f(m-i)k and have the values fmj(a~m) converging (and by our construction, it converges in all Oi,i<m too).
As a result, we can choose the sequence of function 9k = fkk for all positive integers k with the hope that this is a Cauchy sequence. This is where the equicontinuity helps.
Start with any e > 0 and find S£ > 0, such that \f(x) — f(y)\ < e whenever the arguments x and y are closer than Ss. Let Ae c A be subset forming a <Se-net. This is a finite set and so there must be an n e N such that for alH, j > n and all a e Ae, we know \gi(a) — gj(a) | < e. But then, for every x e X, there is some a e Ae with dx (x, a) < Se and so \gi (x) — gj (x) | can be at most
\gi(x) - gi{t)\ + \gi{t) - gj(t)\ + \gj(t) - gj{x)\ < 3e.
Thus, the sequence gk is a Cauchy sequence in C(X), and so M is compact. □
501
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
H. Additional exercises to the whole chapter
7.H.I. Expand the function sin2 (2) on the interval [—7r, it] into a Fourier series. O 7.H.2. Expand the function cos2 (x) on the interval [—tt, it] into a Fourier series. O
7.H.3.   Sum the two series
V- J_        V- (-1)" n4 '       Z^ n4
77=1 77=1
Solution. We hint at the procedure by which the series
00 00
n2k, n2 77=1 77=1
for general k G N can be calculated. Use the identities
Esm na; % x£(0,2f),
77=1
(2)
a;2 = —+4V-V^-47rV-—,    a; G (0, 2tt ,
3 n-2 n
77=1 77=1
which follow from the constructions of the Fourier series for the functions g(x) = x and g(x) = x2, respectively, on the interval [0,2ir). By (1),
E^ = ^?, xe(o,2f).
Substituting into (2) gives
Since the values of the series
Z^    772 6   ' Z^ 772
6   ' Z^ 772 12
77=1 77=1
have been already determined, substitution then proves the validity of this last equation at the marginal points x = 0, x = 2tt. The left-hand series is evidently bounded from above by J2n°=i jh' mus it converges absolutely and uniformly on [0, 27r]. Therefore, it can be integrated term by term:
oo oo     r -|x X oo
sm(Ttx) _ sm(7t7/)       _ j cosjny)
77=1        71 77=1   L      71 J0 0   77=1 71
0
In fact, every Fourier series may be integrated term by term. Further integration gives
El—cos(t7x) v—v   T    cos(ny)~\X        r v—v  sim(ny) n
77=1 77=1   L J U 0 77=1
X q
0
Substituting a; = 7r leads to
= J »3~3^+2,ra» dy = x^-irrxWx2 ^ xG[0,2^]. 0
l + (-l)" + 1   _ l-COs(777r)   _ ^
El+^-lJ     1 _ ^-v    1 —COS^7
Tt1       — Z^ ~ 77=1 77=1
Since the numerator on the left-hand side is zero for even numbers n and is 2 for odd numbers n, the series can be written
2 7ta
(3) X/(2n-i)4 48-
77=1 v 7
From the expression
502
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
zE n4 zE (2n)4 + zE (2n-l)4 16 E „4 + E (2n-l)4' 77=1 77=1 77=1 77=1 77=1
it follows that
oo oo
r i_ I p _ _
(2n-l)4       15 2
V _L - W  V        1        _ 16 1 n4       15  ^ (2n-l
77=1 71=1
thereby having summed up the first series. As for the second one,
OO    , oo oo oo oo
V - V      i     _ V    1    - V ___L V -i-
Z^       n4 <—<  (2n-l)4       Z^  (2n)4        Z^  (2n-l)4       16  Z^ „4
71 = 1 71 = 1 71 = 1 71 = 1 71=1
= I   H__J_    7t4 7tt4 2    48      16    90       720'
(-1)
One can proceed similarly to sum the series
oo oo
zE ^Tfc' zE
71=1 77=1
for other k e N.
It is natural to ask for the value of the series /E^Li ■ This problem has been tackled by mathematicians for centuries without success. The reader may justifiably be surprised by this since the procedure above is applicable to all the odd powers as well.
For instance, one can start with the identity
00
E cos^i = _ln (2 sin I), I€(0,2f), 77=1
which, by the way, can be proved by expanding the right-hand function into a Fourier series. If, similarly to the above, integrate the left-hand series term by term twice and substituted x —> 0+ in the limit, we get the series /E^Li ^s- Thus, it should suffice to integrate the right-hand function twice and calculate one limit. However, the integration of the right-hand side leads to a non-elementary integral. That is, the antiderivative cannot be expressed in terms of the elementary functions. 5
□
7.H.4. In problem 7.45 there occurs the following integral:
00 —00
There, the integral was evaluated by converting the complex exponential to real trigonometric functions. Evaluate it by converting the real function sin (a;) to its complex exponential form. O
7.H.5. Determine the convolution of the functions f± and f2, where
'1 forxe[-l,0]
fi =
h =
0 otherwise
x for x £ [0,1] 0 otherwise.
o
7.H.6.   Determine the function / whose Fourier transform is the function
Solution. We might have noticed that the sine function appeared as the image of the characteristic function hn of the interval (—n, fl) in one of the previous problems:
2i?
hn(u>) = _sine (a; i?).
VZ7T
The function C{p) — Tjn°=i is called the Riemann zeta function. EXPAND THE FOOTNOTE!
503
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
In this case Q = 1 and the function / is the half of h\.
The result can be computed directly. The inverse Fourier transform is
oo — oo
0
/   0 oo \
\-oo 0 /
Substitute —uj for uj in the first integral, to obtain
(oo oo \
0 0 /
oo oo
= £ / ^ 2(e—* + e-*) du = I / ^ cos M) ^. o o Continue via trigonometric identities to obtain
(oo oo \
0 0 /
The substitutions u = a; (1 + i), v = cu (1 — t) then give
(oo oo \
{^du-/^d»u0, t>l; 0 0 /
(oo oo \ oo
0 0/0
(oo oo \
-/=du+/^d,Uo, i<-l. o o /
Thus the function / is zero for | i | > 1 and constant (necessarily non-zero) for | £ | < 1. (Throughout, we assume that the inverse Fourier transform exists).
The constant is j(t) = 1/2 for \t\ < 1, from the standard result
oo
f J™" dtl = f.
J     u 2 0
Alternatively, we can 'guess' that the constant is one, i.e.
g(t) = l,    \t\<l;       5(f) = 0, |*|>1
and compute
i
= ^ / di = ^ /cos M) di =
So /(0) = s(0)/2 = 1/2, which also establishes
f J™" du = f.
J u 2 0
□
7.H.7.   Using the relation
(1) £(/0W = s£(/)W-tlim/(i),
derive the Laplace transforms of both the functions y = cos t and y = sin t. Solution. Notice first that from (1), it follows that
£ (/") (s) = sC (/') (s) - flim f'(t) = s (s£ (/) (s) - lun+ /(i)) -        f'(t) = s2£(f) (s)-slun+ f(t)-\un+f'(t).
Therefore,
—C (siní) (s) = £ (— siní) (s) = £ ((siní)") (s) = s2£ (siní) (s) — s lim siní — lim cosi = s2£ (siní) (s) — 1, whence we get
504
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
—£ (siní) (s) = s2£ (siní) (s) — 1,   i. e.   £ (siní) (s) = -^pj-Now, invoking (1), we determine
£ (cosi) (s) = £ ((siní)') (s) = s s21+1 — lim siní :
□
7.H.8. Using the discussion from the previous problem, prove that for a continuous function y with enough sufficiently higher order derivatives
n
£(y^)(s) = sn£(y)(s) _ £>-y<-i)(0).
i=l
Solution. Clearly
£(y%s) = s£(y)(s)-y(0) £(y")(s) = s2£(y)-Sy(0)-y'(0)
and the claim is verified by induction. □
7.H.9. Find the Laplace transform of Heaviside's function H(t) and, for real a, the shifted Heaviside's function Ha(t) = H(t-a):
!0 for t < 0, \ for t = 0, 1   for t > 0.
Solution.
poo poo
£(H(ť))(s) = /    H(ť)e-Stdt= e-stdt Jo Jo
st
e
s
£(Ha(t))(s) = £(H(t - a))(s)
= -l(0-l) = i
0 b
/*oo /*oo
/    H(t-a)e-stdt = / e-stdi
JO in
p-as
e-S(ť+a) dí = e-aS £(H(t))(s) = -—.
□
7.H.10.   Show that for real a,
(1)       £(f(t) ■ Ha(ť))(s) = e-M + a))(s)
Solution.
pOO poo
£(f(t)Ha(t))(s)= /    /(í)ií(í-a)e-sťdí= / /(í)e"sťdí
JO J a
/*oo /*oo
= /    f(t + a)e-s{t+a'dt = e-as       f(t + a)e-stdt Jo Jo
= e-a°£(f(t + a))(s).
□
505
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
More details ?
7.H.11.   Find a function y{i) satisfying the differential equation
y"(t) + iy(t) = sin2i
and the initial conditions y(0) = 0 and y'(0) = 0. Solution. From the example 7.H.8:
s2£(y)(S)+A£(y)(s) = £(sin2i)(s)
Now, by 7.D. 1(d)
£(sm2t)(s) =
It follows that
The inverse transform then gives
y(t) = ±sin2i - ±fcos2i.
7.H.12.   Find a function y(t) satisfying the differential equation and the initial conditions:
y"(t)+4y(t) = f(t), y(0) = 0, y'(0) = -1, where j(t) is the piecewise continuous function
s2 +4' 2
□
/(*)
cos(2i) for 0 < t < TT, 0 for t > jr.
check the reference
Solution. This problem is a model for the undamped oscillations of a spring (excluding friction and other phenomena like non-linearities in the toughness of the spring and so on). It is initiated by an exterior force during the initial period only and then ceases.
The function j(t) can be written as a linear combination of Heaviside's function H(t) and its shift. That is, f(t) = cas(2t)(H(t)-H„(t)) up to the values t = 0, t = it, ... Since
£(y")(s) = s2£(y) - sy(0) - y'(0) = s2£{y) + 1,
we get, making use of the previous example 7.H.10 , the right-hand sides to the calculation of the Laplace transform
s2£(y) + 1 + 4C(y)   =   £(cos(2t)(ff(t) - H„(t)))
=   £(cos(2i) ■ H(t)) - £(cos(2i) ■ H^(t))
=   £(cos(2t)) - e'7"8 £(cos(2(t + ir)) s
Hence,
m = -
= a-
s2 + 4
s2 +4
(s2 + 4)2
The inverse transform then yields the solution in the form
s
y(t) = -I sin(2í) + \t sin(2i) + £~1[e
(s2+4)s
506
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
According to (1),
^(e^ 7^4)2)       = K-He-™£(*sin(2i)))
= (i - tt) sin(2(i - 71-)) ■ /^(i). Since the Heaviside function H^t) is zero for t < it and equals 1 for t > it, we get the solution in the form
t=Z sin(2i) for 0 < t < ir
(^=2 - tt) sin(2i) forf>7r.
507
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
Key to the exercises
7A.4. We have just to check the orthogonality of the couples Lm(x), Ln(x) with respect to the inner product {f,g)u = Jo°° f(t)g(t) e_t d t. This can be done by integration by parts.
7A.5. The claim follows from the fact that the powers xk appear in the polynomials Tk or Lk the first time. Thus the linear hulls of the first k functions always coincide.
7A.6. x, — ^ix + sin(x); the projection does not change the function \ sin(x) since it already lies in the space.
7A.7. cos(x), ^ cos(x) + x. The projection is 3ir/(ir4 — 24)(4cos(x) +irx). Notice that this is a very bad approximation.
7.B.I. We have already checked the orthogonality of the cosine terms in the solution to the example 7.A.3. The sine terms are obtained the
same way since they are just shifts of cosines by tt/2 in the argument. The mixed couples provide an odd function to be integrated on a
symmetric interval around the origin and so the integral also vanishes.
7.B.2. Look at the previous example and use the substitution y = uix.
7.B.7. Compute exactly as in the exercise 7.B.6 and check the result against the real versions of the same Fourier series, as computed before. The complex coefficients for the real functions are always related by the complex conjugation. That is, c-n = cr7, see 7.1.7. OThtion^check!16    7.B.8. This is again a straightforward computation.
7.C.4.
t-f-+A forte [-2,-1]
l-t+5 forte [-1,1]
^-2t + 2 forte [1,2]
0 otherwise.
7.C.14. It is a good exercise on derivation and integration by parts. We may differentiate with respect to t inside the integral and -jj f(t—x) can be interpreted as —    f(t — x).
7.C.16. Another good exercise on derivation and integration by parts. We may differentiate twice with respect to t inside the integral and /'(t — x) can be interpreted either as a derivative with respect to t or x. 7.D.2. The definition of r(t) reveals
C(ta)= r e-st ta At =e-xxaAx= r(Q + 1). V   ^    Jo Jo
7.D.3. Integrate by parts to obtain
C (g) (S) = Jte- e- dt = Jte^ dt = Inn (^L) - 0 - J ^ dt = - (inn ^ - ^)
(S + l)2
0 0 V / Q
Differentiating the Laplace transform of a general function —/ (i. e., an improper integral) with respect to the parameter s gives
/ CO \   ' CO CO
/-/(t)e--dt    =J-f(t) (e-*)'d* =/*/(*) e-rtdt. \o Jo 0
This means that the derivative of the Laplace transform C{—f) (s) is the Laplace transform of the function tf(t). The Laplace transform of the function y = sinht has already been determined as the function y = s21_1- Therefore,
C (h) (s)
We could also have determined C (g) (s) this way. 7.D.4.
C(coscjt)(s) + i£(smcjt)(s) = C(e"^t)(s)
s — IUJ
o
:Lllm ~^7t )
s — iw t^oo est s — iw     (s — iw)(s + iui)
S Id
--h 1-
S2 + Id2 S2 + Id2
508
CHAPTER 7. CONTINUOUS TOOLS FOR MODELLING
7.G.5. Recall the definition of the product metric of the Cartesian product X x X where the distance is given as the maximum of the distance of the components. The claim follows directly from the triangle inequality and the topological definition of the continuity. 7.H.I. \ - icos(2x). 7.H.2. \ + |cos(2x).
7.H.5. Determine the convolution of the functions fi and j'2, where
fori 6 [-1,0] for t e [0,1]
otherwise.
509
CHAPTER 8
Calculus with more variables
one variable is not enough?
- never mind, just recall vectors!
A. Multivariate functions
We start this chapter with a couple of easy examples to "grasp" a little multivariate functions.
8.A.I.   Solve the system of inequalities. Mark the resulting area in the plane.
x2 + y2 < 4
a)
b)
c)
V >
y < arctanx
y < 1
,2 + (y-l)2 >4 y + x2 - 2x > 0 y>0
At the beginning of our journey through the mathematical landscape, we saw that vectors can be manipulated nearly as easily as scalars. Now, we return to situations where the relations between objects are expressed with the help of more (yet still finitely many) parameters. This is really necessary when modeling processes or objects in practice, where functions R —> R of one variable are seldom adequate. At least, functions dependent on finitely many parameters are necessary, and the dependence of the change of the results on the parameters is often more important than the result itself. There is little need for brand new ideas. Many problems we encounter can be reduced to ones we can solve already. We return to the discussion of situations when the values of functions are described in terms of instantaneous changes. That is, we consider ordinary differential equations. In the next chapter, we consider partial differential equations and provide a gentle introduction to variational problems.
1. Functions and mappings on 1
8.1.1.
Solution. Whenever you have to solve an inequality of the form f(x, y) > 0 (/ : R2 —> R is a function of two variables,   In coordinates, the value is J2
The world of functions. In the sequel, the main objects are mappings between Euclidean spaces, f : Rn —> Rm. We have seen many such ex-M^t~_ amples already. The complex valued real functions correspond to n = 1, m = 2, while the power series converged inside of a circle in the complex plane, providing examples of / : R2 —> R2. We have also dealt with vector valued real functions, representing parametrized curves c : R —> Rn (see e.g. the paragraphs on curvatures and Frenet frames in 6.1.15 on page 388).
In linear algebra and geometry, we saw the linear and affine maps / : Rn —> Rm defined with the help of matrices A e MatmiTl(R) and constant vectors y e Rm:
Rn 3 x^y + Ax e Rm.
In coordinates, the value is given by the expression J2j aijxj + Vi> where A = (a^) and y = (y,).
Finally, the quadratic forms were mappings R™ —> R given by symmetric matrices S = (s^) and the formula
Rn 3 x ^ xT Sx e R.
CHAPTER 8. CALCULUS WITH MORE VARIABLES
but the same method is valid for inequalities with more variables), you just consider the border curve f(x, y) = 0. This curve divides the plane into some areas. Then all points in any of the areas either satisfy the inequality or the whole area does not satisfy it. If we have a system of inequalities, we solve each inequality separately and then intersect the result.
In our cases we get
8.A.2. Determine the domain of the function R2 a)
xy
b)
c)
d)
y(x3 + x2 + x + 1)' ln(a2-y2),
ln(-a:2-y2),
arcsin(2xo(a;)),
where xq denotes the indicator function of the rational numbers,
e)
j(x,y,z) = \/\yíx ■ arcsin(y2ž).
Solution, a) The formula correctly expresses a value iff the denominator of the fraction is non-zero. Therefore, the formula defines a function on the set R2\{( [x, 0], [—l,y],x,y e R}.
b) The formula is correct iff the argument of the logarithm is positive, i. e., \x\ > \y\. Therefore, the domain of this function is {(x, y) e R, \x\ > \y\}. You can see the graph of this function in the picture.
In general, all such mappings / : I posed of m components of functions f start with this case.
are com-
EL So we
8.1.2. Multivariate functions. We can stress the depen-dance on the variables xi,...,xn by writing the functions as
f(x1,x2,...,xn) : Rn ->R.
The goal is to extend methods for monitoring the values of functions and their changes for this situation.
We speak about functions of more variables or, more compactly, multivariate functions.
We often work with the cases n = 2 or n = 3 so that the concepts being introduced are easier to understand. In these cases, letters like x, y, z are used instead of numbered variables. This means that a function / defined in the "plane" R2 is denoted
R2 9 (x,y)^f(x,y)eR, and, similarly, in the "space" R3
R3 3 (x,y,z) ^ f(x,y,z) G R.
Just as in the case of univariate functions, the domain A c R™ on which the function in question is defined needs to be considered. When examining a function given by a concrete formula, the first task is often to find the largest domain on which the formula makes sense.
It is also useful to consider the graph of a multivariate function, i. e., the subset G/ cl"xl = Rn+1, defined by
Gf = {(xii ■ ■ -,xn,f(xi,- ■ -,xn)); (x1,... ,xn) e A},
where A is the domain of /. For instance, the graph of the function defined in the plane by the formula
ti \ x+v
f{x,y) = —.—ä
xz + yz
is quite a smooth surface, see the illustration below. The maximal domain of this function consists of all the points of the plane except for the origin (0, 0).
When defining the function, and especially when drawing its graph, fixed coordinates are used in the plane. Fixing the value of either of the coordinates, implies only one variable remains. Fixing the value of x, for example, gives the mapping
R^R3, y^(x,y,f(x,y)),
i.e., a curve in the space R3. Curves are vector functions of one variable, already worked with in chapter six (see 6.1.13). The images of the curves for some fixed values of the coordi nates x and y are depicted by lines in the illustration.
add the "coordinate lines" in the picture
511
CHAPTER 8. CALCULUS WITH MORE VARIABLES
c) This formula is again a composition of a logarithm and a polynomial of two variables. However, the polynomial — x2 — y2 takes on only non-positive real values, where the logarithm is undefined (as a function R —> R).
d) This formula correctly defines a value iff the argument of the arc sine lies in the interval [—1,1], which is broken by exactly those pairs (= [x, y] e R2 whose first component is rational. The formula thus defines a function on the set
{[x,y],xeR\Q}.
e) The argument of the square root must be non-negative, that is either the image of the logarithm is positive and the image of arcsine as well, or both images are negative. Thus we get that the domain is the set
{[x,y,z] eK3;(i>lAy^0A0<zi)
yZ
W(x e (0, l)Ay ^ OA-i < z < 0)V(x > OAy = 0)}.
yZ
□
In the following examples k([x,y];r) means a circle with the center [x, y] and the radius r.
8.A.3. Determine the domain of the function / and mark the resulting area in the plane:
i) f(x,y) = y/(x2+y2-l)(4-
y
ii) f(x,y) = Vl-x2 + x/l - y2,
iü) f(x,v) = y/g±äEfr, iv) f(x,y) = arcsin f - ,
V
v) f(x,y) = y/T^
vi) f(x,y,z) = ,Jl-£-y^--§.
Solution, a) It has to hold (x2 + y2 -1 > 0,4 - x2 - y2 > 0)
or (x2 + y2 - 1 < 0,4 - x2 - y2 < 0), that is (x2 + y2 >
8.1.3. Euclidean spaces. In the case of functions of one vari-j§t * a^^e' ^e ^tie differential and integral calcu-lus is based on the concepts of convergence, sC2g3~g open neighbourhoods, continuity, and so on. In the last part of chapter seven, these concepts were generalized for the metric spaces, rather than only for the Euclidean spaces R™. Before proceeding it is appropriate to revise these ideas, and do further reading if necessary. We present a brief summary:
The Euclidean space En is perceived as a set of points in R™ without any choice of coordinates, and its modelling vector space R™ is considered to be the vector space of all increments that can be added to the points of the space En (the modelling vector space).
Moreover, the standard scalar product
n
u ■ v = y^xjiji,
i=l
is selected on R™, where u = (xi,..., xn) and v = (y1,..., yn) are arbitrary vectors. This gives a metric on En, i.e. a function describing the distance Q\\ between pairs of points P, Q by the formula
\p-QW'
where u is the vector which yields the point P when added to the point Q. In the plane E2, for instance, the distance between the points Pi = (x1,y1) and P2 = (x2,y2) is given by
ll^i-i^ II2 = (xi-x2)2 + (yi-yz)2. A metric denned in this manner satisfies the triangle inequality for every triple of points P, Q, R:
||P-fill = \\(P-Q) + (Q-R)\\ < ||P-Q|| + ||Q-P||.
512
CHAPTER 8. CALCULUS WITH MORE VARIABLES
1, x2 + y2 < 4) or (x2 + y2 < 1, x2 + y2 > 4), which is an annulus between the circles k([0,0]; 1) and k([0,0]; 2).
b) It is a circle with the center [0,0] and verticies
[±1,±1]
c) The area between circles k([^, 0]; ^) and k([l, 0]; 1), the smaller circle belongs to the area, the bigger one does not.
d) The area between the lines y = x and y = — x (without these lines).
e) The ellipse (together with the inner space) with the center [0,0], with the major axis lying on the 2-axis with the major radius a = 1, and the minor axis on the y-axis with the minor radius b = \.
f) The ellipsoid (with the inner space) with the center [0, 0,0] a semiaxes lying on the x, y, z axis respectively, with radii a ,b, and c. □
B. The topology of En
In the previous chapter, we have denned general metric spaces and we have studies especially metric spaces consisting of the set of functions. As we have already seen in the previous chapter, many metrics can be denned on the space R™ (or on its subsets). For instance, considering a map of a state as a subset of R2, the distance of two points may be defined as the time necessary to get from one of the points to the other by public transport or on foot. In France, for example, the shortest paths between most pairs of points in this metric are far from line segments. In this chapter we will focus on the space En, that is R™ with the usual metric (distance) known to the mankind for a long time. The property, that the shortest path between any two points of this space is the line segment conecting them could be seen as the denning property (for example the above example does not satisfy it). Let us examine the space En in more detail.
8.B.I. Show that every non-empty proper subset of En has a boundary point (which need not lie in it). Solution. Let U C En be a non-empty subset with no boundary point. Consider a point X e U, a point Y e U' := En \ U, and the line segment XY c En. Intuitively, going from X to Y along this segment, we must once get from U to U', and this can happen only at a boundary point (everyone who has ever been to a foreign country is surely well acquainted with this fact). Formally, let A be the point of XY for which \XA\ = sup{\XZ\,XZ e U} (apparently, there is exactly one such point on the segment XY). This point
See 3.4.3(1) in geometry, or the axioms of metrics in 7.3.1, or the same inequality 5.2.2(2) for complex scalars. The concepts denned for real and complex scalars and discussed for metric spaces in detail can be carried over (extended) with no problem for the points Pi of any Euclidean space:
Topology and metric in Euclidean spaces
(1) a Cauchy sequence: a sequence of points Pi such that for every fixed e > 0, \\Pi — Pj\\ < e holds for all indices but for finitely many exceptional values i, j;
(2) a convergent sequence: a sequence of points Pi converges to a point P if nad only if for every fixed e > 0, ||Pi — P|| < e holds for all but finitely many indices i; the point P is then called the limit of the sequence Pi;
(3) a limit point P of a set A c En: there exists a sequence of points in A converging to P and different from P;
(4) a closed set: contains all of its limit points;
(5) an open set: its complement is closed;
(6) an open S-neighbourhood of a point P: the set Os(P) = {Qe En; ||P - Q\\ < 5}, S e R, S > 0;
(7) a boundary point P of a set A: every ^-neighbourhood of P has non-empty intersection with both A and the complement En \ A;
(8) an interior point P of a set A: there exists a ^-neighbourhood of P which lies inside A;
(9) a bounded set: lies inside some ^-neighbourhood of one of its points (for a sufficiently large 5);
(10) a compact set: both closed and bounded.
(11) limit of a mapping: a e Rm is the limit of function / : R" —> Rm in a limit point x0 of its domain A, if for each e > 0, there is a ^-neighbourhood U of x0, such that ||/(a:) — a\\ < e for all x e U; this happens if and only if for each sequence xn e A converging to x0, the values j(xn) converge to a.
(12) continuity: mapping / : A c R™ —> Rm is continuous in x0 e A if the limit lim^^ j(x) exists and equals to / (x0); the mapping / is continuous on A, if it is continuous in all points in A.
Both the first and second items deal with norms of dif-. ...    . .
hodil by se obrazek na
ferences of points approaching zero. Since the square of the iiustracipojmu,napr. norm is the sum of squares of the individual compoments, it is clear that this happens if and only if the individual components approach zero. In particular, the sequences of points Pi are Cauchy or convergent if and only if these properties are possessed by the real sequences obtained from the particular coordinates of the points Pi in every Cartesian coordinate system. Therefore, it also follows from Lemma 5.2.3 that every Cauchy sequence of points in En is convergent. Especially, En is a complete metric space.
Similarly, the mappings from the item (11) are m-tuples of the compoment functions and the limits are given as the m-tuples of limits of these components.
Recall some further results already discussed at the more general level of the metric spaces in chapter seven:
513
CHAPTER 8. CALCULUS WITH MORE VARIABLES
is a boundary point of U: it follows from the definition of A that any line segment XB (with B e XA) is contained in U; in particular, B e U. However, if there were a neighborhood of A contained in U, then there would exist a part of the line segment XY longer than XA which would be contained in U, which contradicts the definition of the point A. Therefore, any neighborhood of the point A contains a point from U as well as a point from En\U. □
8.B.2. Prove that the only non-empty clopen (both closed and open) subset of En is En itself.
Solution. It follows from the above exercise 8.B.1 that every non-empty proper subset U of En has a boundary point. If U is closed, then it is equal to its closure; therefore, it contains all of its boundary points. However, an open set (by definition) cannot contain a boundary point. □
8.B.3. Show that the space En cannot be written as the union of (at least two) disjoint non-empty open sets.
Solution. Suppose that En can be expressed thus, i. &.,En = Ujg/C/j, where I is an index set. Let us fix a set U from this union. Then, we can write En = U U U, where both U and U (being a union of open sets) are open. However, they are also complements of open sets; therefore, they are closed as well. This contradicts the result of the previous exercise 8.B.2. □
8.B.4. Prove or disprove: a union of (even infinitely many) closed subsets of E™ is a closed subset of E™.
Solution. The proposition does not hold. As a counterexample, consider the union
u
j=3
1
" 1
of closed subsets of R, which is equal to the open interval
(0,1). □
8.B.5. Prove or disprove: an intersection of (even infinitely many) open subsets of E/1 is an open subset of E/1.
Solution. The proposition does not hold in general. As a counterexample, consider the intersection
of open subsets of R, which is equal to the closed singleton {!}■ □
A mapping is continuous if and only its preimages of open sets are open (check this carefully!). Further, each continuous function on a compact set A is uniformly continuous, bounded and attains its maximum and minimum, cf. the paragraph 7.3.14 on the page 499.
The reader should make an appropriate effort to read the paragraphs 3.4.3,5.2.5-5.2.8,7.3.3-7.3.5, and 7.3.12 as well as try to think out/recall the definitions and connections of all these concepts.
8.1.4. Compact sets. Working with general open, closed, or compact sets could seem useless in the case of the real line Ei since intervals are almost always used.
In the case of metric spaces in the last part of chapter seven, the ideas are complicated at first sight. However, the same approach is easy in the case of Euclidean spaces R™. It is also very useful and important (and it is, of course, a special case of general metric spaces).
lust as in the case of E\, the open cover of a set (i.e., a system of open sets containing the given set), and Theorem 5.2.8 is also true (with mere reformulations):
Theorem. Subsets A C En of Euclidean spaces satisfy:
(1) A is open if and only if it is a union of a countable (or finite) system of S-neighbourhoods,
(2) every point a G A is either interior or boundary,
(3) every boundary point of A is either an isolated or a limit point of A,
(4) A is compact if and only if every infinite sequence contained in it has a subsequence converging to a point in
A,
(5) A is compact if and only if each of its open covers contains a finite subcover.
Proof. The proof from 5.2.8 can be reused without changes in the case of claims (l)-(3), yet now the concepts have to be perceived in a different way, and the "open intervals" are substituted with multidimensional ^-neighbourhoods of appropriate points.
However, the proof of the fourth and fifth claims has to be adjusted properly. Therefore, it is a good idea to write out the proof of the corresponding propositions for general metric spaces in 7.3.12, while noticing the parts which can be simplified for Euclidean spaces. □
8.1.5. Curves in En. Almost all the discussion about limits, derivatives, and integrals of functions in chapters 5 and 6 concerned functions of a real variable and real or complex values since only the
triangle inequality valid for the magnitudes of the real and complex numbers is used. This argument can be carried over to any function of a real variable with values in a Euclidean space R™. Several tools for the work with curves are introduced in paragraphs 6.1.13-6.1.16.
514
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.B.6. Consider the graph of a continuous function / : R2 —> R as a subset of £3. Determine whether this subset is open, closed, and compact, respectively.
Solution. The subset is not open since any neighborhood of a point [x0, y0, f(x0, y0)] contains a segment of the line x = x0, y = y0. However, there is a unique point of the graph of the function on this segment, and that is the point [xo,yo,f(x0,y0)].
The continuity of / implies that the subset is closed -we will show that every convergent sequence of points of the graph of / converges to a point which also lies in the graph: If such a sequence is convergent in E3, then it must converge in every component, so the sequence {[i„, yn] }^=1 is convergent in R2. Let us denote this limit by [a,b]. Then, it follows from the definition of continuity that its function values at the points [xn, yn] must converge to the value f(a, b). However, this means that the sequence {[xn,yn, f (xn,yn)]}Z°=i con~ verges to the point [a, b, f(a, &)], which belongs to the graph of the function /. Therefore, the graph is a closed set.
The subset is closed, yet it is not compact since it is not bounded (its orthogonal projection onto the coordinate plane xy is the whole R2. (A subset of En is compact iff it is both closed and bounded.) □
And now let us study the limits of functions (a limit is defined thanks to the topology of En, see 8.1.3)
C. Limits and continuity of multivariate functions
If we approach limits of multivariate functins, there is one fact we have to deal with:
Let us emphasize that there is no analogy of L'Hospital rule for multivariate functions. Counting limits 2 or ^, we have to be "clever".
In one dimension we can approach a point either from right or left (and the limit in the point exists, if both one-sided limits exist and are equal to each other). In more dimensions we can approach a point from infinitely many directions, and the limit in the point exists, iff limits of the function narrowed to any path leading to the point exist and must be equal to each other.
The easest way to obtain a limit is (as with the functions of one variable) to plug in the given point to the function prescription and if we get a meaningful expresion, we are done. Otherwise we can get "undeterminate" expression. There are some tricks, we can use to count such a limit:
For every (parametrized) curve1, that is, a mapping c = (ci(t),..., cn(t)) : R —> R™ in an n-dimensional space, the concepts simply extend the ideas from the univariate functions with some extra thoughts:
First note that both the limit and the derivative of curves make sense in an affine space even without selecting the coordinates (where the limit is again a point in the original space, while the derivative is a vector in the modeling vector space!).
In the case of an integral, curves are considered in the vector space R™. The reason for this can be seen even in the case of one dimension, where the origin is needed to be able to see the "area under the graph of a function".
It is apparent that limits, derivatives, and integrals have to be considered via the n individual coordinate components in R™. In particular, their existence is determined in the same way:
Basic concepts for curves
(1) limit:
lim c(t) = (lim ci(/j), . .., lim cn{t)) G R™
t—Yto t—Yto t—Yto
(2) derivative:
c'(io) = nm
1
(c(t) - c(t0))
t^-to  J f — t0\
= (c'1(f0),---,c;(io))GRn
(3) integral:
f c(t) dt= ( f ci(i) dt,..., f cn(t) dt) G R™.
a \Ja Ja /
We can directly formulate the analogy of the connection between the Riemann integral and the antiderivative for curves in R™ (see 6.2.9):
Proposition. Let cbe a curve in Rn, continuous on an interval [a, b]. Then its Riemann integral fb c(t)dt exists. Moreover, the curve
C{t)
c(s)ds G R™
is well-defined, differentiable, andC'(t) = c(t) for all values
t G [a,b].
It is not simple to extend the mean value theorem and, in general, the Taylor's expansion with remainder, see 5.3.9 and 6.1.3. They can be applied in a selected coordinate system to the particular coordinate functions of a differentiable function c(t) = (ci(t),..., cn(t)) on a finite interval [a, b]. In the case of the mean value theorem, for instance, there are numbers t{ such that
Ci(b) - Ci(a) = (b - a) ■ c'^U),    i = l,...,n.
These numbers t{ are distinct in general, so they cannot be expressed as the difference vector of the boundary points
'in geometry, one often makes a distinction between a curve as a subset of En and its parametrization M —> M". The word "curve", means exclusively the parametrized curve here.
515
CHAPTER 8. CALCULUS WITH MORE VARIABLES
(1) factorize the numerator or the denominator acording to some known formula and then reduce,
(2) expand the numerator and the denominator with an appropriate term and then shorten,
(3) bomltodgPression = 0,0 • (bounded expression) = 0;
(4) use an appropriate substitution to get a limit of a function of one variable
(5) try polar coordinates
x = r cos ip, y = r sin p
(it usually works with the expression x2 + y2, we have x2+y2 = r2 cos2 p+r2 sin2 p = r2(cos2 (,2+sin2 p) = r2, which is independent of p);
(6) try y = kx or y = kx2 or generally x = f(k) ay = g(k) (to prove the non-existence of the limit: if the limit after the substitution depends on k, the original limit does not
exists)
8.C.I. lim(_i3/)^.(e2il) ^ O
8.C.2. \mHx^(4A) O
8.C.3. lini(Xij/)^(ii00) O
8.C.4. 101(^)^(0,2) s-!r1 O
8.C.5. lini(Xij/)_j.(00i00) fi^4 O
8.C.6. lim^^^o) 5^ O
8.C.7. 101(^^)^(0,0) O
8.C.8. Hm^^^iJ^r2 O
8.C.9. lim(_,I/)^(1>1) -^±jL- O
8.C.10. lim^^^o) O
8.C.11. 101(^,^)^(0,0) xy2 cos -±_ O
8.C.12. lim(x,^(o,0) ^ O
8.C.13. 101(^^)^(0,0) O
S.C./4. lio(x,J/)^(00,00) (x2 + y2)e-(-+«) O
«.C75. lio(x,j/)^(oo,1)(l + i)^7 O
8.C.16. 011(^)^(0,0) ji^2 O
S.C./7. 011(^)^(0,0) 1~(™$j?y) O
8.C.18. Prove that lim(_i3/)^.(0io) t^t^ does not exists. 0
c(b) — c(a) as a multiple of the derivative of the curve at a single point.
For example, for a differentiable curve c(t) in the plane E2,c(t) = (x(t),y(t))
c(b)-c(a) = (xXZ)(b-a),y'(r,)(b-a))
for two (in general different) values £,77 e [a,b]. However, this reasoning is still sufficient for the following estimate:
Lemma. If c is a curve in En with continuous derivative on a compact interval [a, b], then for all a < s < t < b
\\c(t) - c(s)\\ < v^maxrgr^b] ||c'(r)||) \t - s\.
Proof. Direct application of the mean value theorem gives for appropriate points r{ inside the interval [s, t] the following:
n n
Ht) - c(s)\\2 = J>(i) - c2(S))2 < J2(4(rd(t - s))2
i=l i=l
n
< (f - s)2____ maXrG[S,t] 4(r)2
i=l
< n(oax,.G[s,t], i=i,...,„ \c[(r)\)2(t - s)2
< noaxrG[s,t] ||c'(r)||2(t - s)2.
Another important concept is the tangent vector to a curve c : R —> En at a point c(t0) e En. It is denned as the vector in the modelling vector space R™ given by the derivative c'(t0) e R™.
Consider c to be the path of an object moving in the space in time. Then the tangent vector at a point to can be perceived physically as the instantaneous velocity at this point. maJbe ^ aP»ct The straight line T given parametrically as
T:    c{to) + t-c'{to)
is called the tangent line to the curve c at the point to. Unlike the tangent vector, the (unparametrized) tangent line T is independent of the parametrization of the curve c. The chain rule ensures that changing the parametrization leads to the same tangent vector, up to multiple.
8.1.6. Partial derivatives. If we look at the multivariate
function f(x1,..., i„) : R™ —> R as at the i*^^s^r function of one real variable xt while the other fP^tS^ variables are assumed constant, we can consider ^-=-iff the derivative of this function. This is called the partial derivative of the function f with respect to Xi, and it is denoted as i = 1,..., n, or (without referring to the particular function) as the operator g|- on the functions.
For every function / : R™ —> R and an arbitrary curve c : R -> Rn, their composition (/oc)(():l4l can be considered. This composite function F = f o c expresses the behaviour of the function / along the curve c. The simplest case is using parametrized straight lines c and choosing the lines ci(t) = (x1,... ,x{ + t,..., xn), the derivative of
516
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.C.19. Prove that
lim
2x + xy — y — 2
(x,yyi~(i,-2) x2 + y2 — 2x + Ay + 5' does not exist. O Solution. The lines through [1, —2]) have the equation y = kx — k — 2. As we aproach [1, —2] along one of these lines, we get the limit j^ri, which is different for differnt k, thus the limit does not exists. □ Let us recall, that a funtion is continuous in points, where the limit exists and is equal to to function value.
8.C.20. Find the discontinuity points of f(x,y) = x2x^V-\-
o
8.C.21.  Find the discontinuity points of f(x,y) =
sin(x2y+xy2) p.
8.C.22. Find the discontinuity points of
f(x,y)
x +y x2-\-y2
0
pro[a;,y] ^ [0,0], pro [x,y] = [0,0].
O
D. Tangent lines, tangent planes, graphs of multivariate functions
8.D.I. A car is moving at velocity given by the vector (0,1,1). At the initial time t = 0, it is situated at the point [1, 0,0]. The acceleration of the car in time t is given by the vector (— cosí, — siní, 0). Describe the dependency of the position of the car upon the time t.
Solution. As we have already discussed in paragraph 8.1.5, we got acquainted with the means of solving this type of problem as early as in chapter 6. Notice that the "integral curve" C(t) from the theorem of paragraph 8.1.5 starts at the point (0,0,0) (in other words, C(0) = (0,0, 0)). In the affine space R™, we can move it so that it starts at an arbitrary point, and this does not change its derivative (this is performed by adding a constant to every component in the parametric equation of the curve). Therefore, up to the movement, this integral curve is determined uniquely (nothing else than constants can be added to the components without changing the derivative). When we integrate the curve of acceleration, we get the curve of velocity (— sin t, cos t — 1,0). Considering the initial velocity as well, we obtain the velocity curve of the car: (— sin t, cos t, 1) (we shifted the curve of the vector (0,1,1), i. e., so that now the velocity curve at time t = 0 agrees with the given initial velocity). Further integration leads to the curve
/ocj yields just the partial derivatives . More generally, derivatives can be denned in any direction:
directional and partial derivatives
Definition. / : R™ —> R has derivative in the direction of a vector neR"ata point x e En if and only if the derivative dvf(x) of the composite mapping
t m- f(x + tv)
exists at the point t = 0, i.e.
dvf(x) = Ytmj(f(x + tv)-f(x)).
The partial derivatives are the values      = dEz f where are the elements of the standard basis of R™.
In other words, the directional derivative expresses the infinitesimal increment of the function / in the direction v. For functions in the plane,
d 1
^/(z,y) = lm-(/(z + i,y) -/(x,y)) d 1
fyf(x>y) = J™1,^ (/(^y+ f) - f(x>y))-
Especially, the partial differentiation with respect to a given variable is just the casual one-variable differentiation while considering the other variables to be constants.
8.1.7. The differential of a function. Partial or directional derivatives are not always good enough to obtain a fair approximation of the behaviour of a function by linear expressions. There are three concerns for a function / : R™ —> R there. First, the directional derivatives at a point may not exist in all
directions, although the partial derivatives are well defined. Second, the dependence of the directional derivatives dvf(x) on the direction v need not be linear. Third, even if dvf(x) is a linear mapping in the argument v, the function still may be not 'well behaved' around the point x.
As an example, consider the functions in the plane with coordinates (x, y) given by the formulae
g(x,y) =
h(x,y) =
k{x,y)
1 if xy = 0
0 otherwise
ify = 0
y if x = 0
0 otherwise
x if y = x2 0
0 otherwise.
Both partial derivatives of g at (0,0) exist and no other directional derivatives do, and g is even not continous at the origin. The functions h and k are continuous at (0,0) and h has all its directional derivatives at the origin equal zero, except for the partial derivatives, which are equal to 1. In particular, dvh(0,0) is not a linear mapping in the argument
517
CHAPTER 8. CALCULUS WITH MORE VARIABLES
(cosi—1, siní, ť). Shining this of the vector (1,0,0) then fits with the initial position of the car. Therefore, the car moves along the curve [cos t, sin t, t] (this curve is called a helix).
□
8.D.2. Determine both the parametric and implicit equations of the tangent line to the curve c : R —> R3,
c(t) = (ci(í),c2(í),c3(í)) = (t,t2,t3) at the point which corresponds to the parameter's value t = 1.
Solution. The value t = 1 corresponds to the point c(l) = [1,1,1]. The derivatives of the particular components are c[(t) = l,c'2(t) = It, c3(t) = 3f2. The values of the derivatives at the point t = 1 are 1, 2, 3. Therefore, the parametric equations of the tangent line are:
x = ci(l)s + ci(l) =i + l, y = c2(l)s + c2(l) = 2t + l, z = c'3(l)s + c3(l) = 3i + 1.
In order to get the implicit equations (which are not given canonically), we eliminate the parameter t, thereby obtaining:
2x-y = l,
3x-z = 2. □
8.D.3.   Determine the tangent line p to the curve
c(t) = (\nt, aidant, esin^) at the point t0   = 1.
O
8.D.4. Find a point on the curve c(t) = (t2 - 1, -2i2 + 5í, í — 5) such that the tangent line passing through it is par-alell to the plane g: 3x + y — z + 7 = 0.
Solution. The direction c'(t0) of the curve c(t) in io has to be perpendicular to the normal of g, that is the scalar multiple of these two vectors is 0. The tangent vector in the point c(t) is
(2t, —At + 5,1), the normal vector of the plane p is (3,1, —1) (just read off the coefficients by x, y and z in the equation of p. That is 3 ■ It + 1 ■ (-4i + 5) + 1 ■ 1 = 0, which gives
[3,-18,-7]. □
8.D.5. Find the parametric equation of the tangent line of the curve given as the intersection of surfaces x2+y2+z2 = 4 and x2 + y2 — 2x = 0 in the point [1,1, \/2].
Solution. p= {[1 - V2s,l,V2 +s};s e R}. □
v. More generally, consider a function / which, along the lines (r cos 6, r sin 6) with a fixed angle 6, takes the values a(6)r, where a(6) is a periodic odd function of the angle 6, with period 2ir. All of its directional derivatives dvf at (0,0) exist, yet these are not linear expressions depending on the directions v for general functions a (6). The graph of / can be visualized as a "deformed cone" and we can hardly hope for a good linear approximation at its vertex.
Finally, k has all directional derivatives zero, i.e. dvh(0) = 0 for all directions v, which is a linear dependence on ii G R2. But still, the zero mapping is a very bad approximation of k along the parabolla y = x2. Check all these claims in detail yourselves!
Therefore, we imitate the case of univariate functions as thoroughly as possible, and avoid such a pathological behaviour of functions directly by defining and using the concept of differential:
Differential of a function
Definition. A function / : R™ —> R has got the differential at a point x if and only if all of the following three conditions hold:
(1) the directional derivatives dvf(x) at the point x exist for all vectors »6 1",
(2) dvf(x) is linearly dependent on the argument v,
(3) lim^o ijijj- (f(x + v) - f(x) - dvf(x)) = 0.
The linear expression dvf (in a vector variable v) is then called the differential df of the function f evaluated at the increase v.
In words, it is required that the behaviour of the function / at the point x is well approximated by linear functions of increments of the variable quantities.
It follows directly from the definition of directional derivatives that the differential can be defined solely by the property (3). If there is a linear form df(x) such that the increments v at the point x satisfy the property (3) with dvf[x) = df(x)(y), then df(x)(y) is apparently just the directional derivative of the function / at the point x, so the properties (1) and (2) are automatically satisfied.
8.1.8. Examine what can be said about the differential of jjV " a function f(x, y) in the plane, supposing both par-1fHf>\ tial derivatives      |^ exist and are continuous in a
H£fT neighbourhood of a point (x0, y0). To this purpose, 1>v    consider any smooth curve t i-> (x(t),y(t)) with
x0 = x(0),yo = y(0).
The idea is to use the mean value theorem for univariate
functions for differences of function values, where only one
ofthe variables changes: f(x,y)-f(x0,y) = ^(x1,y)(x-
x0) for suitable x1 between x0 and x.
518
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.D.6. The set of differentiable functions. We can notice iy^X tnat mumvariate polynomials are differentiable on the whole of their domain. Similarly, the composi-*§*    tion of a differentiable univariate function and a differentiable multivariate function leads to a differentiable multivariate function. For instance, the function sin(a; + y) is differentiable on the whole R2; \a(x + y) is a differentiable function on the set of points with x > y (an open half-plane, i. e., without the boundary line). The proofs of these propositions are left as an exercise on limit compositions.
Remark. Notation of partial derivatives. The partial derivative of a function / : R™ —> R in variables x1,..., xn with respect to the variable x\ will be denoted by both J^- and the shorter expression fXl. In the exercise part of the book, we will rather keep to the latter notation. On the other hand, the
notation
AL
dxi
better catches the fact that this is a derivative of
9
/ in the direction of the vector field g^- (you will learn what a vector field is in paragraph 9.1.1).
8.D.7. Determine the domain of the function / : R2 —> R, f(x,y) = x2^fy. Calculate the partial derivatives where they are defined on this domain.
Solution. The domain of the function in question in R2 is the half-plane {(x,y),y > 0}. In order to determine the partial derivative with respect to a given variable, we consider the other variables to be constants in the formula that defines the function. Then, we simply differentiate the expression as a univariate function. We thus get:
fx = 2xy &fy =
The partial derivatives exist at all points of the domain except for the boundary line y = 0. □
8.D.8.   Determine the derivative of the function / : R3 —>
R, f(x, y, z) = x2yz at the point [1, —1,2] in the direction
v = (3,2,-1).
Solution. The directional derivative can be calculated in two ways. The first one is to derive it directly from the definition (see paragraph 8.1.6). The second one is to use the differential of the function; see 8.1.7 and theorem 8.1.8. Since the given function is a polynomial, it is differentiable on the whole R3.
Apply this in both summands of the following expression separately, to obtain
\(f(x(t),y(t))-f(x0,y0)) = \ (f(x(t),y(t))-f(x0, y(t))) + ±(f(x0,y(t))-f(x0,yo))
df df
= l(x(t)-xoy^(x(0,y(t))^(y(t)-yoy^(x0,y(V))
for suitable numbers £ and r\ between 0 and t. Indeed, by exploiting that the curve (x (t), y (t)) is continuous, there must be such values £ and r\.
Especially, for every sequence of numbers tn converging to zero, the corresponding sequences of numbers and i]n also converge to zero (by the squeeze theorem for three limits) and they all satisfy the above equality.
If t converges to 0, the continuity of the partial derivatives, together with the test for convergence of functions using subsequences of the input values (cf. 5.2.15), as well as the properties of the limits of sums and products of functions (cf. Theorem 5.2.13) imply
ftf(x(t),y(t)yt=o=x'(0)^(xo,yo)+y'(0)^(xo,yo),
which is a pleasant extension of the theorem on differentiation of composite functions of one variable for the case foe.
Of course, with the special choice of parametrized straight lines with direction vector v = (£, rf),
(x(t),y(t)) = (x0 + t£, y0 + trj),
the calculation leads to the derivative in the direction v = (£, ri) and the equality
dvf(x0,yo) = ^-(a:o,yo)£+ ^(x0,Vo)v-
This formula can be expressed in a neat way to describe coordinate expressions of linear functions on vector spaces:
df df df = —dx + -rrdy,
ax ay
where dx stands for the differential of the function (x, y) i-> x, i.e. dx(y) = £, and similarly for dy. In other words, the directional derivative dvf is a linear function R™ —> R on the increments, with coordinates given by the partial derivatives.
Now we could similarly prove that the assumption of continuous partial derivatives at a given point guarantees the approximation property of the differential as well. In particular, note that the computation for / oc above excluded phenomena like the function k(x, y) above (there dvk(0,0) = 0, but the derivative along the curve (t, t2) was one). We shall better do this for the general multivariate functions straightaway.
8.1.9. The following theorem provides a crucial and very useful observation.
519
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Let us follow the definition:
fv(x,y,z) = limj[f(x + 3t,y + 2t,z-t) - f(x,y, z)}
= lim j[(x + 3t)2(y + 2t)(z - t) - x2yz]
= \imj[t(6xyz + 2x2z - x2y) + t2{.. .)] = Qxyz + 2x2z — x2y.
We have thus derived the derivative in the direction of the vector (3, 2, —1) as a function of three real variables which determine the point at which we are interested in the value of the derivative. Evaluating this for the desired point thus leads to/„(l,-l,2) = -7.
In order to compute the directional derivative from the differential of the function, we first have to determine the partial derivatives of the function:
fx = 2xyz, fy = x2z, fz = x2y.
It follows from the note beyond theorem 8.1.8 that we can express
/„(1, -1, 2) = 3/41, -1, 2) + 2/„(l, -1, 2)+ + (-l)/*(l,-l,2) = = 3 • (-4) + 2 • 2 + (-1) • (-1) =-7. □
8.D.9.   Determine   the   derivative   of   the function
/ : R3 -+R, f(x,y,z)   = at the point [0,0,2]
in the direction of the vector (1,2,3).
Solution. The domain of this function is R3 except for the plane z = 0. The following calculations will be considered only on this domain. The function in question is differentiable at the point [0,0, 2] (this follows from the note 8.D.6). We can determine the value of the examined directional derivative by 8.1.7, using partial derivatives.
First, we determine the partial derivatives of the given function (as we have already mentioned in exercise 8.D.7, in order to determine the partial derivative with respect to x, we differentiate it as a univariate function (in x) and use the chain rule; similarly for other partial derivatives):
_    2xy sin{x2y)
Jx — , Jy
i(x2y)
fz = -
s(x2y)
Evaluating the expression gives
fx(0,0, 2) +2 -/„(0,0,2) + 3- /z(0,0,2)
= 1-0 + 2. 0 + 3- (-1) = -f.
Continuity of partial derivatives
Theorem. Let f : En ■ "- be a function of n variables with continuous partial derivatives in a neighbourhood of the point x G En. Then its differential df at the point x exists and its coordinate expression is given by the formula
df = 77—dx1 + -—dx2 H-----h 77—dxn.
ax i ax2 oxn
□
Proof. This theorem can be derived analogously to the procedure described above, for the case n = 2. % Care is needed in details to finish the reasoning about the approximation property. As above, consider a curve
c(t) = (Cl(t),...,cn(t)),
c(0) = (0,0) and a point x e R™, and express the difference f(x + c(t)) — f(x) for the composite function f(c(t)) as follows:
f{xi + Cl(f), ...,xn + cn(t)) - f(x1,x2 + c2(t),...) + f{xi,x2 + c2(t),...)) - f(x1,x2, ...,xn + cn(t))
+ f(xi,x2, ...,xn + cn(t)) - f(x1,x2, ■ ■ -,xn).
Now, apply the mean value theorem to all of the n summands, obtaining (similarly to the case of two variables)
df
(ci(f) - ci(0))7^-(a;i + ci(0i), ...,xn + cn{t)) df
+ (c2(t) - c2(0)) — (x1,x2 + c2(92), ...,xn + cn{t))
df
+ (Cn(t) - Cn(0))-^-(x1,X2, ...,Xn + Cn(6n)),
for appropriate values 6i, 0 < 6i < t. This is a finite sum, so the same reasoning as in the case of two variables verifies that
+c(ť))|ť=o = cl(0)g-(a;) + ... + íí,(0)£(a;).
The special choice of the curves c(t) = x+tv for a directional vector v verifies the statement about existence and linearity of the directional derivatives at x.
Finally, apply the mean value theorem in the same way to the difference
f(x + v)- f(x) = dvf(x + Ov)
df df = wi ^j- (x + 6v) H-----\-vn      (x + 6v)
with an appropriate 0, 0 < 0 < 1, where the latter equality holds according to the formula for directional derivatives derived above, for sufficiently small v's.
Since all the partial derivatives are continuous at the point x, for an arbitrarily small e  >  0, there is a
520
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.D.10. Having a function / : Rn -> R with differential df(x) and a point x e R™, determine a unit direction »£l„ in which the directional derivative dv(x) is maximal.
Solution. According to the note beyond theorem 8.1.5, we are maximizing the function fv(x) = v1 fXl (x) + v2 fX2 (x) + ■ ■ ■ + Vnfx„(x) in dependence on the variables v±,... ,vn which are bound by the condition v\ + ■ ■ ■ + vn = 1. We have already solved this type of problem in chapter 3, when we talked about linear optimization (viz 3.A.1). The value fv (x) can be interpreted as the scalar product of the vectors (fx!, ■ ■ ■, fxn) and (iii,..., vn). And this product is maximal if the vectors are of the same direction. The vector v can thus be obtained by normalizing the vector (fXl,..., jXn). In general, we say the the function grows maximally in the direction (fx!, ■ ■ ■, fxn) ■ Then, this vector is called the gradient of the function /. In paragraph 8.1.25, we will recall this idea and go into further details. □ Counnting the differential of a function is technicly very easy, just plug into the definition.
8.D.11.   Find the differential of a function / in a point P:
i) /(a,y) = arctan-^,P=[v/3,l]
ii) f(x,y) = arcsin ^J+yl, P = [1, V^].
iii) f(x,y)=xy+z,P= [1,1].
Solution.
i) df(V3,l) = \dx + \dy,
ii) df(l,y/3) = &dx-\dy,
iii) df(l, 1) = 2dx.
□
Let us realize, that differential of a function is a linear map:
8.D.12.   Count   the   differential    of   the function
f(x,y, z) = 2X sin y arctan 2 in the point [—4,^,0] evaluated on dx = 0.05, dy = 0.06, and dz = 0.08.
Solution. d/(-4, f, 0) = Odx + Ody + ±dz = 0.005. □ The differential thus can be used to approximate the values of a function.
8.D.13. Approximate ^2.982 +4.052 with the use of differential (and not with a calculator).
Solution.   We  use  the  differential  of the function
f(x,y) = in [3,4].    Then f'x =
(^-neighbourhood U of the origin in R™ such that for w e [/, all partial derivatives J^- (x + w) differ from J^- (x) by less than e. Hence the estimate
jj^jj-H/Cz + W) - f(X) - dwf(x)\\ <
< ^\\f(x + w)-f(x)-dwf(x + 0w)\\ +
= Rl|wi(^(a; + M_^(a;)) + ---+ wn(^-(x + ew)-^-(x))\\
ydxn dxn    "11
n
IHI
where 6 is the parameter for which the expression on the second line vanishes. Thus, the approximation property of the differential is satisfied as well. □
The approximation property of the differential can be written as
f(x + v) = f(x) + df(x)(v) + a(v),
where the function a(v) satisfies lim^0 = 0, i.e. a(v) = o(||d||) in the asymptotic terminology introduced in 6.1.16 on the page 391.
8.1.10. A plane tangent to the graph of a function. The linear approximation of the function behaviour by its differential can be expressed in terms of its fjsJZJ graph, similarly to the case of univariate functions. We work with hyperplanes instead of tangent lines.
In the case of a function on E2 and a fixed point (x0, y0) G E2, consider the plane in E3 given by the equation on the three coordinates (x, y, z)\
« = f(xo,yo) + df(x0,y0)(x -x0,y - y0) df df
= f(xo,yo) + t^(xo,yo)(x-xo) + t^(xo,yo)(y-yo)-
It is already seen that the increase of the function values of a differentiable function / : En —> R at points x + tv and x is always expressed in terms of the directional derivative dvf at a suitable point between them. Therefore, this is the only plane containing the point (x0,y0) with the property that all derivatives, and so the tangent lines of all curves
c(t) = (x(t),y(t)J(x(t),y(t)))
he in this plane, too. It is called the tangent plane to the graph of the function /.
Two tangent planes to the graph of the function
f(x,y) = shi(x) cos(y)
are shown in the illustration. The diagonal line is the image of the curve c(t) = (t, t, f(t, t)).
521
CHAPTER 8. CALCULUS WITH MORE VARIABLES
V =
Jv
Vx2+y''
=, thus
\/2.982 + 4.052 = /(32,42) + d/(2.98 - 3,4.05 - 4)
= \/32 + 42 4-= 5_0I06 +
V32 + 42 5
(-0.02) +
V32 + 42
(0.05)
5,028.
□
8.D.14. With the help of a differential calculate
i) arctan tVtt!,
y 0.95'
ii) ln(0,972 + 0,052),
iii) arcsin
iv) 1,042-02'.
o
8.D.15. What is approximately the change (in cm3) in the volume of the cone with a base radius r = 10 cm and height h = 10 cm, if we increase the radius by 5 mm and we decrease the height by 5 rami
Solution. The volume is (as a fuction of the radius r and a height h) V(r,h) = ^irr2h. The change is approximately given by the diffefential of V in [10,10] evaluated on
dr = 10.5 - 10 = 0.5 and dh = 9.5 - 10 = -0.5. We get fircm3. □
8.D.16. Find the tangent plane to the graph of a function / : R2 -> R in a point P = [x0,y0, f(x0,y0):
i) f(x,y)   =   \J\-x2 - y2, P   =   [x0,y0,z0] = \J- J_ ?l
ii) f{x,y) = e*2+y\P= [x0,y0,z0] = [0,0,?],
iii) f(x, y)=x2 + xy + 2y2, P = [x0, y0, z0] = [1,1,?],
iv) f(x, y) = arctan §, P = [x0, y0, z0] = [1, -1, ?].
Solution.
i) f(x0,y0] zo =
f = -Jy
l
l
l
3 3
Further j'x    = —
thus
y/l-x2-y2 =       l/y/3 = l/s/3
^-Ä—.thus/^o.yo) = -T#5 = -1.
_     l/y/3 _ l/y/3
—1. The equation of a tangent
planein[^,^,^] is
^^"^"^"^"^
ii) z0 = l,z = l,
iii) «o = 4, 3a; + 5y — z = 4,
iv) z0 = -z,x + y- 2z = f.
or x
+ y + z = y/3,
For the case of functions of n variables, the tangent plane is defined as an analogy to the tangent plane to a surface in the three-dimensional space. Instead of being overwhelmed by many indices, it is useful to recall affine geometry, where hyperplanes can be used, see paragraph 4.1.3.
Tangent (hyper)planes
Definition. A tangent hyperplane to the graph of a function / : R™ —> R at a point x e R™ is the hyperplane containing the point (x,j(x)) with the modelling vector space which is the graph of the linear mapping df(x) : R™ —> R, i.e. the differential at the point x e En.
The definition takes advantage of the fact that the directional derivative dvf is given by the increment in the tangent (hyper)plane corresponding to the increment v.
Many analogies with the univariate functions follow from the latter fact. In particular, a differentiable function / on En has zero differential at a point x e En if and only if its composition with any curve going through this point has a stationary point there, i.e., is neither increasing, nor decreasing in the linear approximation.
In other words, the tangent plane at such a point is parallel to the hyperplane of the variables (i.e., its modelling space is En c En+1, having added the last coordinate set to zero). Of course, this does not mean that / should have a local externum at such a point. Just as in the case of univariate functions, this depends on the values of higher derivatives. But it is a necessary condition to the existence of extrema.
8.1.11. Derivatives of higher orders. The operation of differentiation can be iterated similarly to the case of univariate functions. This time, choose new directions for each iteration. _ Fix an increment v e R™. The enumeration of the differentials at this argument defines a (differential) operation on differentiable functions / : En —> R
f^dvf = df(v),
and the result is again a function df(v) : En —> R. If this function is differentiable as well, repeat this procedure with another increment, and so on. In particular, work with iterations of partial derivatives. For second-order partial derivatives, write
dxj    dxi J dxidx
dxidxj
522
CHAPTER 8. CALCULUS WITH MORE VARIABLES
□
8.D.17. Find all points on the conic k : x2 + 3y2 — 2x + Qy — 8 = 0 such that the normal of the conic is parallel to the y axe. For each point write the equation of the tangent in the point.
Solution. The normal to k in a point is parallel to the y axe iff the tangent line in the point is parallel to the x axe. The normal to k in [x0, y0] e k is parallel to y axis iff one of the tangents to k in [xq, yo] is parallel to x axis, and this happens iff y'(x0) = 0, where y is a function given implicitly by k in a neighborhood of [x0, y0]. Derivation of the eqution of k
. Thus
gives 2x + Qyy' — 2 + Qy' = 0, that is y' =
W+y)
y(x0y = 0, iff x0 = 1. Substituting to the equation of k we get 1 + 3y$ - 2 + Qy0 - 8 = 0, thus y0 = 1 or y0 = -3. The saught points are [1,1], resp. [1, —3], the equations of tangents in the points are y = 1, resp. y = — 3. □
8.D.18. On the conic given by the equation 3x2 + Qy2 — 3x + 3y — 2 = 0 find all points where the normal to the conic is parallel with the line y = x. For each point give the equation of the tangent in the point. O
8.D.19. On the conic given by the equation x2 + xy + 2y2 — x + 3y — 54 = 0 find all points where the normal to the conic is parallel to the x axis. For each point give the equation of the tangent in the point. O
8.D.20. On the graph of the function u(x,y,z) =x \Jy2 + z2 find all points where the tangent plane is parallel to the plane
x + y — z — w = 0. O
8.D.21. Findthepointsontotheellipsoida;2+2y2+2:2 = 1, where the tangent planes are parallel to x — y + 2z = 0.
Solution. The equation of the tangent plane is determined by the partial derivatives of z = z(x,y) given implicitly by the equation x2 + 2y2 + z2 = 1 of the ellipsoid. The normal vector in [x0, y0, z0] is (z'x (x0, y0), z'y (x0, y0), -1). This vector has to be parallel to the normal (1, —1, 2) of the plane, thus (-2z'x(x0,y0), -2z'y(x0,y0),2) = (1,-1,2). which yields 2xo = «o,4yo = —«o and after substituting to the ellipsoid's equation we get the sought points: [^=, -^=, -j=\
Another solution. It is useful to realize, that the   normal   vector   in   [x0,y0,z0]   of   the surface
In the case of the repeated choice i = j, write
loll/:
dxj    dxj I dx
2 1 dx2'
Proceed in the same way with further iterations and talk about partial derivatives of order k
dxi
■ dxi
More generally, one can iterate (assuming the function is sufficiently differentiable) any directional derivatives; for instance, dv o dwf for two fixed increments v,w e Rn.
fc-times differentiable functions
Definition. A function / : En —> R is k-times (continuously) differentiable at a point x if and only if all its partial derivatives up to order k (inclusive) exist in a neighbourhood of the point x and are continuous at this point.
/ is fc-differentiable if it is fc-times (continuously) differentiable at all points of its domain.
From now on, the work is with continuously differentiable functions unless explicitly stated otherwise.
To show the basic features of the higher derivatives in the simplest form, work in the plane E2, supposing the second-order partial derivatives are continuous. In the plane as well as in the space, iterated derivatives are often denoted by mere indices referring to the variable names, for example:
dx
fx
11 f
dx2 ' Ixy
a2/
fyx
Ö2/
dxdy ' dydx
We show that the continuous partial derivatives commute. That is, the order in which differentiation \i, is carried out does not matter.
Suppose that the partial derivatives exist and are continuous, i.e., the limits
1
fxy(x, y) = lim - (fx(x, y + t)- fx(x, y))
= !ini, 7 (limn ~ (f(x +s' y +f) ~ -f^' y + t)
- f(x + s,y) + f(x,y)
exist. However, since the limits can be expressed by any choice of values tn —> 0 and sn —> 0 and the limits of the corresponding sequences,
fxy(x,y) = lim ^\-({f{x + t,y + t) - f(x,y + t)) - (f(x + t,y)-f(x,y))y
and this limit value is continuous at (x, y).
Consider the expression from which the last limit is taken as the function p(x,y,t), and try to express it in terms of partial derivatives. For a temporarily fixed t, denote g(x, y) =
523
CHAPTER 8. CALCULUS WITH MORE VARIABLES
given implicitly by F(x, y,z)    =    0 is the vector
(Fx(x0,yo, z0),Fy{x0, y0, z0),Fz(x0, y0, z0)). □
8.D.22. Determine whether the tangent plane to the graph of the function / : R x R+ -> R, f(x, y) = x ■ ln(y) at the point [1, i] goes through the point [1,2, 3] g R3. Solution. First of all, we calculate the partial derivatives: fx(x,y) = My), fy(x,y) = I; their values at the point [1, are —1, e; further f(l, -\) = —1. Therefore, the equation of the tangent plane is
z = f^l)+fx{l,l) (,-l) + /,(l,±) (y-I)
= —1 — x + ey.
The given point does not satisfy this equation, so it does not lie in the tangent plane. □
8.D.23. Determine the parametric equation of the tangent line to the intersection of the graphs of the functions / : R2 -> R, f(x, y) = x2 + xy - 6; g : R x R+ -> R, g(x,y) = x ■ ln(y) at the point [2,1]. Solution. The tangent line to the intersection is the intersection of the tangent planes at the given point. The plane that is tangent to the graph of / and goes through the point [2,1] is
z = f (2,1) + fx(2, l)(x - x0) + fy(2, l)(y - y0) = 5x + 2y-l2. The tangent plane to the graph of g is then
z = f(2,l) + gx(x,y)(2,l)(x - x0) + g(x,y)y(2,l)(y-y0) = 2y-2.
The intersection line of these two planes is given parametri-cally as [2,f,2f -2],ieR.
Another solution. The normal to the surface given by the equation f(x,y,z) = 0 at the point b = [2,1,0] is (fx(b), fy(b), fz(b)) = (5, 2, —1); the normal to the surface given by g(x, y,z) = 0 at the same point is (0,2, —1). The tangent line is perpendicular to both normals; we can thus obtain a vector it is parallel to as the vector product of the normals, which is (0,5,10). Since the tangent line goes through the point [2,1,0], its parametric equation is [2,1 + t,2t], f g R. □
8.D.24. V. ypočtěte všechny parciální derivace prvního a
druhého řádu funkce f(x,y,z) = x*.
Solution. fx = ^zxv--\ťy = xv-\nx-\,fz=xv-\nx- =£,
fxx = ^(^~^)X*    2Tfyy=X'm X'~i
f(x + t,y) — f(x, y). Then the expression in the last large parentheses is, by the mean value theorem, equal to
g{x,y + t) - g{x,y) = t ■ gy{x,y + t0)
for a suitable t0 which lies between 0 and t (the value of t0 depends on t).
Now, gy{x,y) = fy(x + t,y) - fy(x,y), so we may rewrite ip as
p(x,y,t) = jgy{x,y + t0)
= \ {fy(x + t'V + to) - fy(x,V + to))-Another application of the mean value theorem yields
<p(x, V, t) = fyx(x + h,y + t0)
for a suitable 11 between 0 and t. However, if the large parentheses are split into (f(x + t, y + t) — f(x + t, y)) — (f(x,y + t)—f(x, y)), we get, by the same procedure with the function
h(x, y) = f(x, y + t) — f(x, y), the expression
<p(x, y, t) = fxy(x + s0, y + Si)
with (in general) different constants s0 and s1. Since it is assumed that the second-order partial derivatives are continuous, the limit for t —> 0 guarantees the desired equality
fxy(x,y) = fyx(x,y)
at all points (x, y).
8.1.12. The same procedure for functions of n variables proves the following fundamental result:
commutativity of partial derivatives
Theorem. Let f : En —> R be a k-times differentiable function with continuous partial derivatives up to order k (inclusive) in a neighbourhood of a point x g Rn. Then all partial derivatives of the function f at the point x up to order k (inclusive) are independent of the order of differentiation.
Proof. The proof for the second order is illustrated
above in the special case when n = 2. In fact, it yields the general case as well.
Indeed, notice that for every fixed choice of
a pair of coordinates xt and Xj, the discussion of their interchanging takes place in a two-dimensional affine subspace, (all the other variables are considered to be constant and do not affect in the discussion). So neighbouring partial derivatives may interchanged. This solves the problem in order two.
In the case of higher-order derivatives, the proof can be completed by induction on the order. Every order of the indices ii,...,ik can be obtained from a fixed one by several interchanges of adjacent pairs of indices. □
524
CHAPTER 8. CALCULUS WITH MORE VARIABLES
E^-Hnx ■ =%,f" = x* In2 2 ■      + 2* lri2 ■
+
□
8.D.25. Find all first and second order partial derivatives of z = f(x, y) in [1, \/2,2] denned in a neighborhood of the
point by 22 + y2 + z2 — xz — \/2yz = 1.
8.D.26. Find all first and second order partial derivatives of z = f(x, y) in [—2, 0,1] denned in a neighborhood of the
point by 222 + 2y2 + z2 + 8xz -2 + 8 = 0. O
8.D.27. Determine all second partial derivatives of the function / given by f(x, y, z) = y/xy]nz. Solution. First, we determine the domain of the given function: the argument of the square root must be non-negative, and the argument of the natural logarithm must be positive. Therefore, Df = {(2, y, z) e R3, (z > lk(xy > 0)) V (0 < z < l)k(xy < 0)}.
Now, we calculate the first partial derivatives with respect to each of the three variables:
fx
yln(z)
f' fy
xln(z)
= , fz
xy
2\Jxy ln(z) JV 2\Jxy ln(z) J 2z^/xy ln(z) Each of these three partial derivatives is again a function of three variables, so we can consider (first) partial derivatives of these functions. Those are the second partial derivatives of the function /. We will write the variable with respect to which we differentiate as a subscript of the function /.
fXX
fxy fxz
fyy
fyz fzz
2 1 2
y In z 4(xy In 2)1
xy In2 z In z
4(xylnz)i 2yjxy\iiz'
xy2 In z
+
Az(xy\az)i     2z\/xy In z
21 2 2 In z
4(xy In 2)1
22yIn z
+
4z(2ylnz)2     2zx/xy lnz' xy
x2y2
4z2(xy lnz)! 2z2y/xy\nz
8.1.13. Hessian. The differential was introduced as the linear form df(x) which approximates the function / at a point 2 in the best possible way.
Similarly, a quadratic approximation of a function / : En —> R is possible.
Hessian
Definition. If / : R™ —> R is a twice differentiable function, the symmetric matrix of functions
O Hf(x)
dxidxj
.(2)
OX 1OX 1 v
v  9"f (x) is called the Hessian of the function /.
OXlOXn V /
92f (x).
It is already seen from the previous reasonings that the vanishing of the differential at a point (2, y) e E2 guarantees stationary behaviour along all curves going through this point. The Hessian
Hf(x,y) = (fr^y] f;y{fx>y])
JK \fyx(x,y) fyy(x,y)J
plays the role of the second derivative. For every parametrized straight line
c(t) = (x(t),y(t)) = (20 + £t, y0 + r]t),
the derivative of the univariate function a(t) = f(x(t),y(t)) can be computed by means of the formula =
fx(x(t),y(t))x'(t) + fy(x(t),y(t))y'(t) (derived in 8.1.8) and so the function
df df P(t) = f(xo,yo) +t — (x0,y0)^ + t — (x0,y0)7]
t2 (
+ -^lfxx(xo,yo)£2 + ^fxy{x0,y0)^ri + fyy(x0,y0)^
shares the same derivatives up to the second order (inclusive) at the point t = 0 (calculate this on your own!). The function (3 can be written in terms of vectors as
ß(t) = f{xQ,yo) + df(x0,y0)(tv) + -^Hf(x0,y0)(tv,tv),
where v = (£, rf) is the increment given by the derivative of the curve c(t), and the Hessian is used as a symmetric 2-form.
This is an expression which looks like Taylor's theorem for univariate functions, namely the quadratic approximation of a function by Taylor's polynomial of degree two. The following illustration shows both the tangent plane and this quadratic approximation for two distinct points and the function f{x,y) = sm(x) cos(y).
525
CHAPTER 8. CALCULUS WITH MORE VARIABLES
By the theorem about interchangeability of partial derivatives (see 8.1.12), we know that fxy = fyx, fxz = fzx, fyz = fzy. Therefore, it suffices to compute the mixed partial derivatives (the word "mixed" means that we differentiate with respect to more than one variable) just for one order of differentiation. □
£. Taylor polynomials
8.E.I. Write the second-order Taylor expansion of the function / : R2 -> R, f{x,y) = \n(x2 + y2 + 1) at the point
[1.1]-
Solution. First, we compute the first partial derivatives:
fx
2x
-J,
2y
x2 + y2 + VJy     x2+y2 + V
then the Hessian:
Hf(x,y) =
4Xy
(X2+y2 + l)2 (X2+y2 + l)2
4Xy
' (x2+y2 + l)2        (X2+y2 + l)2
The value of the Hessian at the point [1,1] is
2 _4
\ 25 9 9
Altogether, we get that the second-order Taylor expansion at the point [1,1] is
T2(x,y) =/(l, 1) + fx(l, l)(x - 1) + fy(l, l)(y - 1)
+ \(x-l,y-l)Hf(l,l) ^yZ\
= ln(3) + ^(x-l) + ^(y-l) + l(x-l)2
=\ix2 + V2 + 8x + % - AxV - 14) + ln(3)-9
□
Remark. In particular, we can see that the second-order Taylor expansion of an arbitrary differentiable function at a given point is a second-order polynomial.
8.E.2. Determine the second-order Taylor polynomial of the function / : R2 —> R2, f(x, y) = xy cosy at the point [tt, 7r]. Decide whether the tangent plane to the graph of this function at the point [tt,tt, j(71,11)] goes through the point
[0,7t,0].
Solution. As in the above exercises, we find out that
TV       "i      1   2  2 3,1-4
T(xi y) = 271 y ^xy^7ry+27r'
8.1.14. Taylor's expansion. The multidimensional version vJ-sw        of Taylor's theorem is an example of a mathe-'■*ScTS^"   matical statement where the most difficult part is finding the right formulation. The proof is then quite simple. The discussion on the Hessians continues. Write Dk f for the fc-th order approximations of the function / : En —> R™. It is always a fc-linear expressions in the increments.
The differential D 1f = df (the first order) and the Hessian D2 f = Hf (the second order) are already discussed. For functions / : En —> R, points x = (x1,..., xn) G En, and increments v = (£1,..., £„), set
Dkf(x)(v)=
l<ti,...,tfc<n
- (x i, . . . , Xn ) £t 1" ' '
An illustrative example (making use of the symmetry of the partial derivatives) is, for E2, the third-order expression
^y)(^)-^e+^ev
, o &f , 2^ S0/ 3 + 3A^A,2 ^   + '
oxoyz ay0
and, in general,
Dkf(x, y) (£, „) = £ (T\ rot^'V
1=0
£) dxk edye
Taylor expansion with remainder
Theorem. Let f : En —> R be a k-times differentiable function in a neighbourhood Og(x) of a point x G En. For every increment v G Rn of size \\v\\ < S, there exists a number 9, 0 < 9 < 1, such that
f(x + v) = f(x) + D'fix)^) + ZD2f(x)(v)+
"'+ Ji~TyDk~lf{x){v) + h°kf{x + 9'v){v)-
Proof. Given an increment v  G  R™, consider the {<parametrized straight line c(t) = x + tv in En, and examine the function ip : R —> R defined by the composition ip(t) = f o c(t). Taylor's theorem for univariate functions claims that (see Theorem 6.1.3)
526
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The tangent plane to the graph of the given function at the point [tt, 7t] is given by the first-order Taylor polynomial at the point [tt, tt]; its general equation is thus
z = —Try — ttx + tt2,
and this equation is satisfied by the given point [0, tt, 0]. □
8.E.3. Determine the third-order Taylor polynomial of the function / : R3 —> R, f(x,y,z) = x3y + xz2 + xy + 1 at the point [0,0,0]. O
8.E.4. Determine the second-order Taylor polynomial of the function / : R2 —> R, f(x, y) = x2 sin y + y2 cos x at the point [0,0]. Decide whether the tangent plane to the graph of this function at the point [0,0,0] goes through the point
[tt, 7t, tt]. O
8.E.5. Determine the second-order Taylor polynomial of the function ln(x2y) at the point [1,1]. O
8.E.6. Determine the second-order Taylor polynomial of the function / : R2 -> R,
f(x,y) = tan(xy + y)
at the point [0, 0]
o
F. Extrema of multivariate functions
8.F.I.   Determine the stationary points of the function
/ : R2 —> R, f(x, y) = x2y + y2x — xy and decide which of these points are local extrema and of which type.
Solution. The first derivatives are fx = 2xy + y2 — y, fy = x2 + 2xy — x. If we set both partial derivatives equal to zero simultaneously, the system has the following solution: {x = y = 0}, {x = 0, y = 1}, {x = l,y = 0}, {x = l/3,y = l/3}, which are four stationary points of the given function.
The Hessian of the function / is
2y 2x + 2y- l\
K2x + 2y - 1 2x )'
Its values at the stationary points are, respectively,
0
-1
1 1 1 0
0 1
1 1
2 1
I I
3 3
-1 0
Therefore, the first three Hessians are indefinite, and the last one is positive definite. The point [1/3,1/3] is thus a local minimum. □
ip(t) = <p(0) + <p'(0)t + ...
It remains to verify that computing the derivatives <p^ yields the desired relation. This can be done quite easily by induction on the order k.
For k = l, Taylor's theorem coincides with the corollary of the mean value theorem applied to the directional derivative, which is already used several times. When deriving it, the formula
±<p(t)=§t(X(t)) ■x,1(t)+--+■ <(t)
is used. It holds for every continuously differentiable curve and function /, cf. 8.1.8 and 8.1.9. This means that
p<(t) = D1f(c(t))(c<(t)) = D1f(c(t))(v)
for all t in a neighbourhood of zero. Proceed similarly for functions Def. Write c' (t) instead of the increment v, and recall that further differentiation of c(t) leads identically to zero everywhere, i.e. c" (t) = 0 for all t (since it is a parametrized straight line). Suppose
<pM(t) = Dif(x(t))(v)
7
-(x1(t),...,xn(t))x'n(t)---x[i(t)
and calculate p1-1^ (t). By the above formula for first-order differentiation in a given direction and the rule for the derivative of a product (see Theorem 5.3.4), the differentiation of the composite function gives
dt
E
97
E E
-(xi(t),...,xn(t))
(xßt),.. .,xn(t))
*j(t) •<(*) •••<(*)) +o,
which is the required formula for order £+1. Taylor's theorem now follows from the enumeration at the point t = 0 and substituting into the equality for ip at the beginning of this proof. □
8.1.15. Formula with multi-indices. To simplify the notation, introduce the multi-index notation for the polynomials with more variables.
527
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.F.2. Determine the point in the plane x+y +3z = 5 lying in R3 which is closest to the origin of the coordinate system. First, do this by applying the methods of linear algebra; then, using the methods of differential calculus. Solution. It is the intersection point of the perpendicular going through the point [0,0,0] to the plane. The normal to the plane is (t, t, 3t),t e R. Substituting into the equation of the plane, we get the intersection point [5/11,5/11,15/11].
Alternatively, we can minimize the distance (or its square) of the plane's points from the origin, i. e., the function
(5-y-3z)2 + y2 + z2. Setting the partial derivatives equal to zero, we get the system
3y + Wz - 15 = 0 2y + 3z - 5 = 0,
whose solution is as above. Since we know that the minimum exists and is the only stationary point, we need not calculate the Hessian any more. □
Multi-indices
A multi-index a of length n is an n-tuple of non-negative integers (qi, ..., an). The integer \a\ = ol\ + ■ ■ ■ + an is called the size of the multi-index a.
Monomials are written shortly as xa instead of a;"1 rig2 • • • a;"". Real polynomials in n variables can be symbolically expressed in a similar way as univariate polynomials:
\<k
\ß\<e
eR[xu.
8.F.3.   Determine the local extrema of the function f(x, y) = x2 + arctan2 x + I y3 + y I ,   x, y e
f is said to have total degree k if at least one coefficient with multi-indices a of size k is non-zero, while all the coefficients with multi-indices of larger sizes vanish.
Nice formulae express addition and multiplication of multivariate polynomials of degrees k and I respectively:
f + g= (aa+ba)xa,
a | <max(fc,^)
k+i   , x
19 = E (   E (fla^)a;7j,
|7|=0^a+/3=7 '
where the multi-indices are added componentwise, and the formally non-existing coefficients are assumed to be zero. Moreover we write shortly
Solution. The function / can be written as the sum /i + f2, ^ a\ _ where fi(x) = x2 + arctan2a;, f2(y) = y3 + y , x, y G R.
Ql
8 f = -aJ &iQi...6tenQ"
■ an\. In particular, daf = fi£a = Q.
If the function / has a local externum at a point, then it does so with respect to an arbitrary subset of its domain. In other words, if the function has, for instance, a maximum at a point [a, 6] and we set y = b, then the univariate function f(x,b) of x must have a maximum at the point x = a. Let us thus fix an arbitrary y e R. For this fixed value of y, we get a univariate function, which is a shift of the function /i. This means that its maxima and minima are at the same points. However, it is easy to find the extrema of the function /i. We can just realize that this function is even (it is the sum of two even functions, and the function y = arctan2 x is the product of two odd functions) and increasing for x > 0 (the composition as well as the sum of increasing functions is again an increasing function). Therefore, it has a unique externum, and that is a minimum at the point x = 0. Similarly, for any fixed value of x, f is a shift of the function f2, and f2 has a minimum at the point y = 0, which is its only externum. We have thus proved that / can have a local externum only at the origin. Since
/(0,0) = 0,       f(x,y)>0,    [x,y] GR2x{[0,0]},
Taylor polynomials via multi-indices
Taylor expansion up to order r of a function / : R™ —> R, for an increment v e R™ is the polynomial
(1)       f(x + v) = f{x)+   V -ydaf{x)va,
L—' a!
l<|a|<fc
quite as the formula in dimension one.
If the multidimensional power series F(v) = E|Q|>o ~t}paf{x) va converges at some neighborhood of v = 0, we call the function / (real) analytic on a neighborhood of x. For instance, this happens if all the partial derivatives are uniformly bounded, i.e. daj(x) < C for all a, since then we easily find a convergent single variable majorant power series J3|Q|>o ir^lM!'0'- Think about the details.
8.1.16. Local extrema. We examine the local maxima and minima of functions on En using the differential and the Hessian, lust as in the case of univariate functions, an interior point xq £ En of the domain of a function / is said to be a (local) maximum or minimum if and only if there is a neighbourhood U of x0 such that for all points x e U, the function value satisfies j(x) < f(xo) or j(x) > f(xo), respectively. If strict inequalities hold for all x ^ x0, there is a strict extremum.
528
CHAPTER 8. CALCULUS WITH MORE VARIABLES
the function / has a strict local (even global) minimum at the point [0,0]. □
8.F.4.   Examine the local extrema of the function
f(x,y) = (x + y2) et,   x,y eR.
Solution. This function has partial derivatives of all orders on the whole of its domain. Therefore, local extrema can occur only at stationary points, where both the partial derivatives fx, fy are zero. Then, it can be determined whether the local extremum occurs by computing the second derivatives. We can easily determine that
fx{x,y) =^ + \ {x + y2)ei, fy(x,y) = 2yet, x,yeR. A stationary point [x, y] must satisfy
fy(x,y) = 0,   i.e.   y = 0,
and, further,
fx(x,y) = fx(x,0) = et (l + \x) = 0, i.e. x = -2. We can see that there is a unique stationary point, namely
[-2,0].
Now, we calculate the Hessian Hf at this point. If this matrix (the corresponding quadratic form) is positive definite, the extremum is a strict local minimum. If it is negative definite, the extremum is a strict local maximum. Finally, if the matrix is indefinite, there will be no extremum at the point. We have
fxx{x,y) = \ e* (2 + \ (x + y2)) ,   fyy(x,y) = 2e*, fxy(x,y) = fyx(x,y) = yef, x,yeR. Therefore,
rrf(, m _ (f(-2,0)    fxy (-2,0)\ "'"I  '  [fyx (-2,0)     fyy (-2,0))
_ fl/2e 0
~ V 0 2/ey We should recall that the eigenvalues of a diagonal matrix are exactly the values on the diagonal. Further, positive definite-ness means that all the eigenvalues are positive. Hence it follows that there is a strict local minimum at the point [—2,0]. □
8.F.5.   Find the local extrema of the function
f(x, y, z) = x3 + y2 +     — 3xz — 2y + 2z,    x,y,z G R.
Solution. The function / is a polynomial; therefore, it has partial derivatives of all orders. It thus suffices to look for its stationary points (the extrema cannot be elsewhere). In order
To simplify, suppose that / has continuous both first-order and second-order partial derivatives on its domain. A necessary condition for the existence of an extremum at a point x0 is that the differential be zero at this point, i.e., df(x0) = 0. If df(x0) ^ 0, then there is a direction v in which dvf(x0) ^ 0. However, then the function value is increasing at one side of the point x0 along the line x0 + tv and it is decreasing on the other side, see 5.3.2.
An interior point x e En of the domain of a function / at which the differential df(x) is zero is called a stationary point of the function f.
To illustrate the concept on a simple function in E2, consider f(x, y) = sin(a;) cos(y).
The shape of this function resembles the well-known egg plates, so it is evident that there are many extrema, and also many stationary points which are not extrema ("saddles" are visible in the picture).
Calculate the first derivatives, and then the necessary second-order ones:
fx(x, y) = cos(x) cos(y), fy(x, y) = - sin(x) sin(y),
and both derivatives are zero for two sets of points
(1) cos(a;) = 0, sin(y) = 0, that is (x,y) = (2^-71-, for), for any k,£ e Z
(2) cos(y) = 0, sin(x) = 0, that is (x,y) = (kir, ^-tt), for any k,£ e Z.
The second partial derivatives are
Hf(x,y) = (f;* f;v)(x,y)
fxy fyy/
— sin(a;) cos(y)   — cos(a;) sin(y)
— cos(a;) sin(y)   — sin(a;) cos(y)
So the following Hessians are obtained in two sets of stationary points:
(1) Hf(kir + ^,£tt) = ± ^   ^, where the minus sign
occurs when k and I have the same parity (remainder on division by two), and the sign + occurs in the other case;
(2) Hf(kTr, for +■_■) = ± ^   ^, where again the minus
sign occurs when k and I have the same parity, and the sign + occurs in the other case;
From the proposition of Taylor's theorem for order k = 2, there is, in a neighbourhood of one of the stationary points
529
CHAPTER 8. CALCULUS WITH MORE VARIABLES
to find them, we differentiate / with respect to each of the   (x0, y0), three variables x, y, z and set the derivatives equal to zero. We thus obtain
3x2 — 3z = 0,   i. e.,   z = x2,
2y-2 = 0,   i.e.,   y = 1,
and (utilizing the first equation)
z -3a; + 2 = 0,   i.e., xe{l,2}.
Therefore, there are two stationary points, namely [1,1,1] and [2,1,4]. Now, we compute all second-order partial derivatives:
fx
QX: fXy fyX 0, fXZ fZ
-3,
fyy      2,     fyz      fzy      0,     fzz 1.
Having this, we are able to evaluate the Hessian at the stationary points:
Hf (1,1,1)
Hf (2,1,4) =
Now, we need to know whether these matrices are positive definite, negative definite, or indefinite in order to determine whether and which extrema occur at the corresponding points. Clearly, the former matrix (for the point [1,1,1]) has eigenvalue A = 2. Since its determinant equals —6 and it is a symmetric matrix (all eigenvalues are real), the matrix must have a negative eigenvalue as well (because the determinant is the product of the eigenvalues). Therefore, the matrix Hf (1,1,1) is indefinite, and there is no extremum at the point [1,1,1].
We will use the so-called Sylvester's criterion for the latter matrix Hf (2,1,4). According to this criterion, a real-valued symmetric matrix
	/ «n	ai2	ais ■	' aln\
	ai2		Ö23 '	a2n
A =	ais	Ö23	«33 '	a3n
	\ain	a2n	a3n '	ann J
is positive definite if and only if all of its leading principal minors A, i. e. the determinants
an   ai2 ari «3 =  aY2    «22    «23 , • • • «13    «23 «33
di = Ian I , d2
an ai2
«12 «22
f(x,y) = f(x0,yo)+
+ \Hf(x0 + 6(x - x0),y0 + 6(y - y0))(C, rj).
d-n — \A\,
are positive. Further, it is negative definite iff
Here, Hf is considered to be a quadratic form evaluated at the increment (x — x0,y — y0) = (£,rj). In the case (1), Hf(x0,yo)(Z,v) = ±(£2 + «2)> while in the case (2), Hf(x0,y0)(^,ri) = ±2^77. While in the first case, the quadratic form is either always positive or always negative on all nonzero arguments, in the second case, there are always arguments with positive values and other arguments with negative values.
Since the Hessian of the function is continuous (i.e. all the partial derivatives up to order two are continous), the Hessians in the nearby points are small perturbations of those in (x0,y0) and so these properties of the quadratic form Hf(x,y) remain true on some neighbourhood of (x0,y0). This is obvious in cases (1) and (2), since a small perturbation of the matrices does clearly not change the latter properties of the quadratic forms in question. A general formal proof is presented below.
The local maximum occurs if and only if the point (x0, y0) belongs to the case (1) with k and I of the same parity. On the other hand, if the parities are different, then the point from the case (1) happens to be a point of a local minimum.
On the other hand, in the case (2) the entire function / behaves similarly to the Hessian and so the "saddle" points are not extrema.
8.1.17. The decision rules. In order to formulate the general statement about the Hessian and the local extrema at stationary points, it is necessary to remember the discussion about quadratic forms from the paragraphs 4.2.6^1.2.7 in the chapter on affine geometry. There are introduced the following types of quadratic forms h : En —> R:
• positive definite if and only if h(u) > 0 for all u =^ 0
• positive semidefmite if and only if h(u) > 0 for all u G V
• negative definite if and only if h(u) < 0 for all u =^ 0
• negative semidefmite if and only if h(u) < 0 for all u G V
• indefinite if and only if h(u) > 0 and f(v) < 0 for appropriate u,v G V.
There are methods to allow determining whether or not a given form has any of these properties.
The Taylor expansion with remainder immediately yields the following rules:
530
CHAPTER 8. CALCULUS WITH MORE VARIABLES
di < 0,   d2 > 0,   d3 < 0,   ..., (-l)ndn > 0. The inequalities
12 0 -3
= 24 > 0,   0 2    0   = 6 > 0,
Local extrema
12 = 12 > 0,
12 0 0 2
-3   0 1
imply that the matrix Hf (2,1,4) is positive definite - there is a strict local minimum at the point [2,1,4]. □
8.F.6.   Find the local extrema of the function
z = (x2-\) (l-a4-y2), x,yeR.
Solution. Once again, we calculate the partial derivatives zx, zy and set them equal to zero. This leads to the equations
-6a5 + 4a3 + 2a - 2ay2 = 0,    (a2 - l) (-2y) = 0,
whose solutions [a,y] = [0,0], [a,y] = [1,0], [a, y] = [—1, 0]. (In order to find these solutions, it suffices to find the real roots 1,-1 of the polynomial —6a4 + 4a2 + 2 using the substitution u = a2. Now, we compute the second-order partial derivatives
zxx = -30a4 + 12a2 + 2 - 2y2, zXy — ZyX — 4ai/, zyy = —2 (a — l) and evaluate the Hessian at the stationary points:
Hz (0,0)
2 0 0 2
Hz (1,0) = Hz (-1,0) =
-16 0 0 0
We can see that the first matrix is positive definite, so the function has a strict local minimum at the origin.
However, the second and third matrices are negative semidefinite. Therefore, the knowledge of second partial derivatives in insufficient for deciding whether there is an extrémům at the points [1,0] and [—1,0]. On the other hand, we can examine the function values near these points. We have z (1,0) = z (-1,0) = 0, z (a,0) < Ofora e (-1,1). Further, consider y dependent on a e (—1,1) by the formula y = a/2 (1 — a4), so that y —> Ofora —> ±1. For this choice, we get z (a, sj2 (1 - a4)) = (a2 - l) (a4 - l) > 0, a G (—1,1). We have thus shown that in arbitrarily small neighborhoods of the points [1,0] and [—1,0], the function z takes on both higher and lower values than the function value at the corresponding point. Therefore, these are not extrema.
□
Theorem. Let f : En ■ "- be a twice continuously differ-entiable function and a G En be a stationary point of the function f. Then
(1) f has a strict local minimum at a if Hf(x) is positive definite,
(2) f has a strict local minimum at a if Hf(x) is negative definite,
(3) f does not have an extremum at a ifHf (a) is indefinite.
Proof. The Taylor second-order expansion with remainder applied to our function f(x1,..., xn), an arbitrary point a = (ai,..., xn), and any increment v = (v1,..., vn), such that all points x + 8v,8 e [0,1], lie in the domain of the function /, says that
/(a + v) = /(a) + df(x)(v) + X-Hf(x + 9 ■ v)(v)
for an appropriate real number 6, 0 < 6 < 1. Since it is supposed that the differential is zero, we obtain
f(x + v) = f(x)+l-Hf(x + 6-v)(v).
By assumption, the quadratic form Hf(x) is continuously dependent on the point a, and the definiteness or indefiniteness of quadratic forms can be determined by the sign of the major subdeterminants of the matrix Hf, see Sylvester's criterion in paragraph 4.2.7. However, the determinant itself is a polynomial expression in the coefficients of the matrix, hence a continuous function. Therefore, the non-vanishing and signs of the examined determinants are the same in a sufficiently small neighbourhood of the point a as at the point a itself.
In particular, for a positive definite Hf(x), it is guaranteed that at a stationary point a, /(a + v) > /(a) for sufficiently small v. So this is a sharp minimum of the function / at the point a. The case of negative definiteness is analogous. If Hf(x) is indefinite, then there are directions v, w in which f(x + v) > /(a) and f(x + w) < f(x), so there is no extremum at the stationary point in question. □
The theorem yields no result if the Hessian of the function is degenerate, yet not indefinite at the point in question. The reason is the same as in the case of univariate functions. In these cases, there are directions in which both the first and second derivatives vanish, so at this level of approximation, it cannot be determined whether the function behaves like t3 or ±t4 until higher-order derivatives in the necessary directions are calculated.
At the same time, even at those points where the differential is non-zero, the definiteness of the Hessian Hf(x) has similar consequences as the non-vanishing of the second derivative of a univariate function. For a function / : R™ —> R, the expression z(x + v) = /(a) + df(x)(v) defines the tangent hyperplane to the graph of the function / in the space
531
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.F.7.   Decide whether the polynomial
p(x, y) = x6 + y8 + yAxA — x6y5 has a local extremum at the stationary point [0,0]. Solution. We can easily verify that the partial derivatives px and py are, indeed zero at the origin. However, each of the partial derivatives pxx, pxy, pyy is also equal to zero at the point [0,0]. The Hessian Hp (0,0) is thus both positive and negative semidefinite at the same time. However, a simple idea can lead us to the result: We can notice that p(0,0) = 0 and
p(x, y) = x6 (l - y5) +ys + y4a4 > 0 for [x,y] G R x (-1,1) \ {[0,0]}. Therefore, the given polynomial has a local minimum at the origin. □
8.F.8. Determine local externa of the function / : R3 —> R,
f(x, y, z) = x2y + y2z + x — z on R3. O
8.F.9.
Determine the local externa of the function / : R3 —> R,
f(x, y, z) = x2y — y2z + Ax + z on R3. O
8.F.10. Determine the local externa of the function / :
R3 -> R, f(x,y, z) = xz2 +y2z - x + y onR3. O
8.F.11. Determine the local externa of the function / :
R3 -> R, f(x, y, z) = y2z - xz2 + x + Ay on R3. O
8.F.12. Determine the local externa of the function / :
R2 —> R, f(x, y) = x2y + x2 + 2y2 + y on R2 O
8.F.13. Determine the local externa of the function / :
R2 —> R, f(x, y) = x2y + 2y2 + 2y on R2. O
8.F.14. Determine the local externa of the function / :
R2 —> R, f(x, y) = x2 + xy + 2y2 + y on R2. O
8.F.15. Determine the local externa of the function / :
R2 —> R, f(x, y) = x2 + xy — 2y2 + y on R2. O
G. Implicitly given functions and mappings
8.G.I. Let F : R2 -> R be a function, F(x, y) = ay sin (-|ay2). Show that the equality F(x, y) = 1 implicitly defines a function / : U —> R on a neighborhood U of the point [1,1] so that F(x, f(x)) = 1 for x G U. Determine /'(!)■
Solution. The function is differentiable on the whole R2, so it is such on any neighborhood of the point [1,1]. Let us evaluate^ at [1,1]:
Fy(xi V) = x sin ( 7;xy2) + KX2y2 cos i^-xy
Rn+1. Taylor's theorem of order two with remainder, as used in the proof above, provides the expression f(x + v) = z(x + v) + ±Hf(x + 0v)(v).
If the Hessian is positive definite, all the values of the function / he above the values of the tangent hyperplane for arguments in a sufficiently small neighbourhood of the point x, i.e. the whole graph is above the tangent hyperplane in a sufficiently small neighbourhood. In the case of negative definiteness, it is the other way round.
Finally, when the Hessian is indefinite, the graph of the function has values on both sides of the hyperplane. This happens, in general, along objects of lower dimensions in the tangent hyperplane, so there is no straightforward generalization of inflection points.
8.1.18. The differential of mappings. The concepts of derivative and differential can be easily extended to mappings F : En —> Em. Having selected ^iv^t   the Cartesian coordinate system on both sides, this mapping is an ordinary m-tuple
F(xu...,xn) = (/i(ai,.
i)> • • • 5 /m(ai, • • •, an))
of functions /, : En —> R. F is denned to be a differentiable or k-times differentiable mapping if and only if the corresponding property is shared by all the functions fi,...,fm-The differentials dfc (x) of the particular functions fi give a linear approximation of the increments of their values for the mapping F. Therefore, we can expect that they also give a coordinate expression of the linear mapping D1F(x) : R™ —> Rm between the modelling spaces which linearly approximates the increments of the mapping F.
Differential and Jacobi matrix
Consider a differentiable mapping F : ponents ... ,xn),.. .,fm(x1,
domain. The matrix
n -4- Rm with com-., i„)) and x in its
D1F(x) =
fdh(x)\ df2(x
\dfm(x)J
all.	9£l
dxi	8X2
a/2	a/2
	dx2
öL	aL
	dx2
dfi
dXn
dxn
dxn
\
(x)
is called the Jacobi matrix of the mapping F at the point x.
The linear mapping D1F(x) denned on the increments v = (v1,..., vn) by the Jacobi matrix is called the differential of the mapping F at a point x in the domain if and only if
lim      ( F(x + v) - Fix) - D1F(x)(v) } = 0. •y->o \\v\\ \
Recall that the definition of Euclidean distance guarantees that the limits of values in En exist if and only if the limits of the particular coordinate components do. Direct application of Theorem 8.1.6 about the existence of the differential for functions of n variables to the particular coordinate
532
CHAPTER 8. CALCULUS WITH MORE VARIABLES
so Fy(\, 1) = 1 0. Therefore, it follows from theorem 8.1.24 that the equation F(x,y) = 1 implicitly determines on a neighborhood of the point (1,1) a function /:[/—» R defined on a neighborhood of the point (number) 1. Moreover, we have
(tt      2\       tt      o (it
Fx(x,y) = ysm y-xy j + -xy cos \^-xy' so the derivative of the function / at the point 1 satisfies
□
Remark. Notice that although we are unable to explicitly define the function / from the equation F(x, f(x)) = 1, we are able to determine its derivative at the point 1.
8.G.2.   Considering the function F : R2 -> R,
F{x, y) = ex sin(y) + y - tt/2 - 1,
show that the equation F(x,y) = 0 implicitly defines the variable y to be a function of x, y = f (x), on a neighborhood of the point [0,tt/2]. Compute/'(0).
Solution. The function is differentiable in a neighborhood of the point [0,7r/2]; moreover, Fy = ex cosy+l,_F(0,7r/2) = 1 7^ 0, so the equation indeed defines a function / : U —> R on a neighborhood of the point [0,7r/2]. Further, we have Fx = ex siny, Fx(0, tt/2) = 1, and its derivative at the point 0 satisfies:
Fy(0,tt/2)
1
8.G.3. Let
F(x, y, z) = sin(a;y) + sin(y2) + sin(a;;z).
Show that the equation F(x, y,z) = 0 implicitly defines a function z(x, y) : R2 —> R on a neighborhood of the point
[tt, 1, 0] e R3 so that F(x, y, z(x, y)) = 0. Determine zx(tt, 1) and zv(tt, 1).
Solution. We will calculate Fz = ycos(yz) + xcos(xz), Fz(tt, 1,0) = tt + 1 7^ 0, and the function z(x, y) is defined by the equation F(x, y, z(x, y)) = 0 on a neighborhood of the point [tt, 1, 0]. In order to find the values of the wanted partial derivatives, we first need to calculate the values of the remaining partial derivatives of the function F at the point
M,o].
Fx(x,y, z) = ycos(xy) + zcos(xz) Fx(tt, 1,0) = -1, Fy(x,y,z) = xcos(xy) + zcos(yz) Fy(rT, 1,0) = — tt,
functions of the mapping F thus leads to the following generalization (prove this in detail by yourselves!):
Existence of the differential
Corollary. LetF : En —> Em be a mapping such that all of its coordinate functions have continuous partial derivatives in a neighbourhood of a point x G En. Then the differential D1F(x) exists, and it is given by the Jacobi matrix D1 F(x).
8.1.19. Lipschitz continuity. Continuous differentiability of mappings allows good control on their variability in the following sense. Assume the estimates of the difference F(y) — F(x) for all x and y from a convex compact subset K in the domain of F are of interest. Applying the Taylor theorem with remainder in order one on each of the components of F = (/i,..., /„) separately gives the estimate (write v = y — x)
\\F(y)-F(x)f = Yjm)-h^
i=l
m
= J2\D1f(x + e,v)(v)\2 df
m n
i=i j=i i
dfc (  \\2\        II   112
max —— 2 nmm
zeK,i,j< dxj     ' J
for an appropriate constant C > 0. The fact that continuous functions are bounded over each compact set is used.
This is the property of Lipschitz continuity of F on the compact set K:
\\F(y) -F(x)\\ <q|y-x||,   for all x, y e K
which was considered in 7.3.14 in the end of chapter 7.
Lemma. Each continuously differentiable mapping F : Rn —> Rm is Lipschitz continuous over convex compact sets.
8.1.20. Differential of composite mappings. The following theorem formulates a very useful generalization of the chain rule for univariate functions. Except for the concept of the differential itself, which is mildly complicated, it is actually the same as the one already seen in the case of one variable.
The lacobi matrix for univariate functions is a single number, namely the derivative of the function at a given point, so the multiplication of lacobi matrices is simply the multiplication of the derivatives of the outer and inner components of the function. There is, of course, another special case: the formula derived and used several times for the derivative of a composition of multivariate functions with curves. There, the differential is the one form expressed via the partial derivatives of the outer components, evaluated on the vector of the derivative of the inner component, again given by the product of the one line (the form) and one column (the vector).
533
CHAPTER 8. CALCULUS WITH MORE VARIABLES
odkud
The chain rule
zx(tt, 1) = zv{tt,\) =
Fx(tt,1,0) =
>Z(7T,1,0) 7T+1'
Fy (it, 1,0) = tv
>z(tt,1,0)     tt + 1'
□
8.G.4. Having the mapping F : R3 -> R2, F (x, y, z) = (f(x,y,z),g(x,y,z)) = (ex sm^ rryz), show that the equation F(x, ci(x), c2(a;)) = (0, 0) defines a curve c : R —> R2 on a neighborhood of the point [1, it, 1]. Determine the tangent vector to this curve at the point 1. Solution. We will calculate the square matrix of the partial derivatives of the mapping F with respect to y and z: _ffy   fz\ _ (xcosyex^y 0
H(x,y,z)
9y 9z
Hence, H(l,ir, 1) =
-1 0
1 TT
xy
and detiř(l,7T, 1) =
—tt=£0. Now, it follows from the implicit mapping theorem (see8.1.24) that the equation F(x, c1(x), c2(x)) = (0, 0) on a neighborhood of the point [1, it, 1] determines a curve (ci (x), c2 (x)) defined on a neighborhood of the point [1, tt]. In order to find its tangent vector at this point, we need to determine the )column) vector (fx, gx) at this point:
\9x)     \     yz      ) ' \gx{l,Ti, 1)) \ The wanted tangent vector is thus
(Cl)x(l)\ (fy(l,lT,l) f^TT,!)^'1 f fx(l,7t,l (C2)x(l)))        \9y(^,l)     9z(l,TT,l)J \9x(l,TT,l)
-1 0
1 TT
-1 0 1 1
□
H. Constrained optimization
We will begin with a somewhat atypical optimization problem.
8.H.I. A betting office accepts bets on the outcome of a tennis match. Let the odds laid against player A winning be a : 1 (i. e., if a bettor bets x dollars on the event that player A wins and this really happens, then the bettor wins ax dollars) and, similarly, let the odds laid against player B winning be b : 1 (fees are neglected). What is the necessary and sufficient condition for (positive real) numbers a and b so that a bettor cannot guarantee any profit regardless the actual outcome of the match? (For instance, if the odds were laid 1.5 : 1 against
Theorem. Let F : En —> Em and G : Em —> Er be two differentiable mappings, where the domain ofG contains the whole image of F. Then, the composite mapping G o F is also differentiable, and its differential at any point x in the domain of F is given by the composition of differentials
D1{G o F)(x) = D1G{F{x)) o D1F{x).
The Jacobi matrix on the left hand side is the product of the corresponding Jacobi matrices on the right hand side.
Proof. In paragraph 8.1.6 and in the proof of Taylor's i< i theorem, it was derived how the differentiation of mappings composed of functions and curves behaves. This proved the theorem in the special case of n = r = 1. The general case can be proved analogously, one just has to work with more vectors.
Fix an arbitrary increment v and calculate the directional derivative for the composition G o F at a point x e En. This means to determine the differentials for the particular coordinate functions of the mapping G composed with F. To simplify, write g o F for any one of them.
dv(g o F)(x) = lim j(g(F(x + tv)) - g(F(x))).
The expression in parentheses can, from the definition of the differential of g, be expressed as
g(F(x + tv)) - g(F(x) = dg(F(x))(F(x + tv) - F(x))
+ a(F(x + tv) -F(x)),
jwhere a is a function defined on a neighbourhood of the point F(x) which is continuous and lim^o ][ii[Q(1') = 0- Substitution into the equality for the directional derivative yields
dv(g o F)(x) = lim i fdg(F(x))(F(x + tv) - F(x))
+ a(F(x + tv) - F(x))^j
= dg(f(x)) (lim i (f(x + tv) - f(x)
+ lim j (a(f(x + tv) - f(x)) = dg(f(x)) o D1f(x)(v) + 0.
The fact that linear mappings between finite-dimensional spaces are always continuous was used. In the last step the Lipschitz continuity of F, i.e. \\F(x + tv) — F(x)\\ < C\\v\\t was exploited, and the properties of the function a.
So the theorem for the particular functions gi,..., gr of the mapping G is proved. The theorem in general now follows from the definition of matrix multiplication and its links to linear mappings. □
534
CHAPTER 8. CALCULUS WITH MORE VARIABLES
the win of A and 5 : 1 against the win of B, then the bettor could bet 3 dollars on B winning and 7 dollars on A winning and profit from this bet in either case).
Solution. Let the bettor have P dollars. The bet amount can be divided to kP and (1 — k)P dollars, where k e (0,1). The profit is then akP dollars (if player A wins) or b(l — k)P dollars (if B does). The bettor is always guaranteed to win the lesser of these two amounts; the total profit (or loss) is obtained by subtracting the bet P, then. Since each of a, b, P is a positive real number, the function akP is increasing, and the function b(l — k)P is decreasing with respect to k. For k = 0, b(l—k)P is greater; for k = l, akP is. The minimum of the two numbers akP and b(l — k)P is thus maximal for a k e (0,1), namely for the value k0 which satisfies ak0P = b(l — ko)P, whence fco = Therefore, the betting office must choose a, b so that akoP = b(l — ko)P < P, which is equivalent to ak0 < 1, i. e., ab < a + b. □ We managed to solve this constrained optimization problem even without using the differential calculus. However, we will not be able to do so in the following problems.
8.H.2.   Find the extremal values of the function
h(x, y, z) = x3 + y3 + z3 on the unit sphere S in R3 given by the equation
F(x, y, z) = x2 + y2 + z2 — 1
as well as on the circle which is the intersection of this sphere with the plane
G(x, y, z) = x + y + z.
Solution. First, we will look for stationary points of the function h on the sphere S. Computing the corresponding gradients (for instance, grad h(x, y, z) = (3a;2, 3y2, 3z2)), we get the system
0 = 3a;2 - 2Aa;, 0 = 3y2 - 2Ay, 0 = 3z2 - 2Az, 0 = a;2 + y2 + z2 - 1
consisting of four equations in four variables. Before trying to solve this system, we can estimate how many local con-strianed extrema we should anticipate the function to have. Surely, h(P) is in absolute value equal to at most 1, and this happens at all intersection points of the coordinate axes with
8.1.21. Transformation of coordinates. A mapping F :
A/7 En —> En which has an inverse mapping G : &Jttffffi En ^ En denned on the entire image of /•' is &Sij^U,'~:^ called a transformation. Such a mapping can be perceived as a change of coordinates. It is usually required that both F and G be (continuously) differentiable mappings.
Just as in the case of vector spaces, the choice of "point of view", i.e. the choice of coordinates, can simplify or deteriorate comprehension of the examined object. The change of coordinates is now being discussed in a much more general form than in the case of affine mappings in the fourth chapter. Sometimes, the term "curvilinear coordinates" is used in this general sense. An illustrative example is the change of the most usual coordinates in the plane to polar coordinates. That is, the position of a point P is given by its distance r = \Jx2 + y2 from the origin and the angle p = arctan(y/a;) between the ray from the origin to it and the a>axis (if x ^ 0).
	n\x
ii 1 1 / u      If. I	\   \ \ \ ' ffYi V~ 1 iiiii
	i M
\ V.   x V ""J ' X* "\ ^ J	X   ■   '   ' '/ >\ / / // —' / p^x4/
The illustration shows the the "line" r = p drawn in the Cartesian coordinates.
The change from the polar coordinates to the standard ones is
Ppolar = (rV ^ {rcosp,rsinp) = ^Cartesian
It is apparent that it is necessary to limit the polar coordinates to an appropriate subset of points (r,p)in the plane so that the inverse mapping would exist. The Cartesian image of lines in polar coordinates with constant coordinates r or p is also shown in the illustration above.
We illustrate by an example, usage of the concept of transformation and the theorem about differen-tiation of composite mappings. The inverse to ':/^^kT   the above is the transformation F : R2 —> R2 CO*iF—   (for instance, on the domain of all points in the first quadrant except for the points having x = 0):
r = \/x2 + y2, p = arctan —.
x
Consider now the function gt : E2 —> R, with free parameter
t e R,
g{r, p, t) = sin(r - t)
in polar coordinates. Such a function can approximate the waves on a water surface after a point impulse in the origin at the time t, see the illustration (there, t = —tt/2). While it
535
CHAPTER 8. CALCULUS WITH MORE VARIABLES
S. Therefore, we are likely to get 6 local extrema. Further, inside every eighth of the sphere given by the coordinate planes, there may or may not be another extremum. The particular quadrants can be easily parametrized, and the function h (considered a function of two parameters) can be analyzed by standard means (or we can have it drawn in Maple, for example).
Actually, solving the system (no matter whether algebraically or in Maple again) leads to a great deal of stationary points. Besides the six points we have already talked about (two of the coordinates equal to zero and the other to ±1) and which have A = ±|, there are also the points
P±
Vs Vs Vs
for example, where a local extremum indeed occurs.
If we restrict our interest to the points of the circle K, we must give another function G another free parameter 77 representing the gradient coefficient. This leads to the bigger system
0 = 3x2 - 2Xx - 77, 0 = 3y2 - 2Ay - 77, 0 = 3z2 - 2Az - 77, 0 = x2 + y2 + z2 - 1, 0 = x + y + z.
However, since a circle is also a compact set, h must have both a global minimum and maximum on it. Further analysis is left to the reader. □
8.H.3. Determine whether the function / : R3 -> R, f(x, y, z) = x2y has any extrema on the surface 2x2 + 2y2 + z2 = 1. If so, find these extrema and determine their types.
Solution. Since we are interested in extrema of a continuous function on a compact set (ellipsoid) - it is both closed and bounded in R3 - the given function must have both a minimum and maximum on it. Moreover, since the constraint is given by a continuously differentiable function and the examined function is differentiable, the extrema must occur at stationary points of the function in question on the given set. We can build the following system for the stationary points:
2xy = Akx, x2 = Aky, 0 = 2kz.
was easy to define the function in polar coordinates, it would have been much harder to guess with Cartesian coordinates.
Compute the derivative of this function in Cartesian coordinates. Using the theorem,
dg ,       .     dg ,     . dr dg ,     . dp , .
-(x,y,t) = -{r,p)-(x,y) + -{r,p)-(x,y)
i(^x2 +y2 -t)
\A2 + :
+ 0
and, similarly,
dg ,       .     dg ,     -.dr dg ,     .dp , .
-{x,y,t) = -(r,p)-(x,y) + -(r,p)-(x,y)
= cos(\/x2 + y2 — i) —-——
\A2 +
8.1.22. The inverse mapping theorem. If the first derivative of a differentiable univariate function is non-zero, its sign determines whether the func-KirjNfc tion is increasing or decreasing. Then, the func-^r^s>-j— tion has this property in a neighbourhood of the point in question, and so an inverse function exists in the selected neighbourhood. The derivative of the inverse function /_1 is then the reciprocal value of the derivative of the function / (i.e. the inverse with respect to multiplication of real numbers).
Interpreting this situation for a mapping E1 —> E1 and linear mappings R —> R as their differentials, the nonvanish-ing is a necessary and sufficient condition for the differential to be invertible as a linear mapping. In this way, a statement is obtained which is valid for all finite-dimensional spaces in general:
The inverse mapping theorem
Theorem. Let F : En —> En be a differentiable mapping on a neighbourhood of a point x0 G En, and let the Jacobi matrix D1F(x0) be invertible.
Then in some neighbourhood of x0, the inverse mapping F-1 exists, it is differentiable, and its differential at the point F(x0) is the inverse mapping to the differential D^F(x0).
Hence, D1 (F~v)(F(x0)) is given by the inverse matrix to the Jacobi matrix of the mapping F at the point x0.
536
CHAPTER 8. CALCULUS WITH MORE VARIABLES
This system is satisfied by the points [± , , 0] and [±-^=, —0]. The function takes on only two values at these four stationary points. Ir follows from the above that the first and second stationary points are maxima of the function on the given ellipsoid, while the other two are minima. □ Remark. Note that we have used the variable k instead of A from the theorem 8.1.28.
8.H.4. Decide whether the function / : R3 -> R, f(x, y,z) = z — xy2 has any minima and maxima on the sphere
x2 + y2 + z2 = 1. If so, determine them.
Solution. We are looking for solutions of the system
kx = -y2, ky = -2xy, kz = 1.
The second equation implies that either y = 0ora; = —■§. The first possibility leads to the points [0,0,1], [0, 0, -1]. The second one cannot be satisfied. Note that because of the third equation k =^ 0 and substituting into the equation of the sphere, we get the equation
k2    k2 1
t + t + F=1'
which has no solution in real numbers (it is a quadratic equation in k2 with the negative discriminant). The function has a maximum and minimum, respectively, at the two computed points on the given sphere. □
8.H.5. Determine whether the function / : R3 -> R, f(x, y, z) = xyz, has any externa on the ellipsoid given by the equation
g(x,y, z) = kx2 + ly2 + z2 = 1,   k, / G R+.
If so, calculate them.
Solution. First, we build the equations which must be satisfied by the stationary points of the given function on the ellipsoid:
dg ,df
—— = A— :   yz = ZXkx, ox ox
— = A— :   xz = 2Xly, dy ay
dg xdf
—— = A— :   xy = 2\z.
Oz Oz
Proof. First, verify that the theorem makes sense and is as expected. If it is supposed that the inverse mapping exists and is differentiable at
4&F(x0), then differentiating the composite mapping F-1 o F enforces the formula
idK„ = D1(F~1 o F)(x0) = D^F-1) o D1F{x0),
which verifies the formula at the conclusion of the theorem. Therefore, it is known at the beginning which differential for F-1 to find.
Next, suppose that the inverse mapping F-1 exists in a neighbourhood of the point F(x0) and that it is continuous. Since F is differentiable in a neighbourhood of x0, it follows that
(1)    F(x) - F(x0) - D^ixo^x - x0) = a(x - x0)
with function a : R™ —> 0 satisfying lim^0 p|j-q(d) = 0. To verify the approximation properties of the linear mapping (_D1_F(a;o))~1, it suffices to calculate the following limit for y = F(x) approaching y0 = F(x0):
lim _J—-(F-1^)-^-1^)-^^))-1^-^)).
y->y° 112/ — 2/011
Substituting (1) for y — yo into the latter equality yields
lim j-77 (x — xq
y^yo \\y - y0\\ \
- (D1F(x0))~1(D1F(x0)(x - x0) + a(x - x0))^j
lim
y^yo
-1
\y-yo\
= (D'Fixo))-1
(-1)
lim
y->yo
xo)) (a(x - x0)),
\y - yo\
where the last equality follows from the fact that linear mappings between finite-dimensional spaces are always continuous. Hence performing this linear mapping commutes with the limit process.
The proof is almost finished. The limit at the end of the expression is, using the properties of a, zero if the values \\F(x)—F(x0)\\ are greater than C\\x—x0\\ for some constant C > 0. This can be translated in terms of the inverse as q|^-1(y)-^-1(y0)l| < ||y-y0||,i.e.
\\F-1(y)-F-1(y0)\\<D\\y-y0\\
for the constant D = C_1 > 0. This is Lipschitz continuity, which is a stronger property than F-1 being continuous. So, now it remains "merely" to prove the existence of a Lipschitz-continuous inverse mapping to the mapping F.
To simplify, reduce the general case slightly. Especially, without loss of generality, apply shifts of the % coordinates by constant vectors. In particular, it can be assumed that x0 = 0 G Rn, yo = F(xo) = 0 G R™. So assume this property of the mapping F.
537
CHAPTER 8. CALCULUS WITH MORE VARIABLES
We can easily see that the equation can only be satisfied by a triple of non-zero numbers. Dividing pairs of equations and substituting into the ellipse's equation, we get eight solutions, namely the stationary points x = ±y|p V = ^^jp z = ±^7g- However, the function / takes on only two distinct values at these eight points. Since it is continuous and the given ellipsoid is compact, / must have both a maximum and minimum on it. Moreover, since both / and g are continuously differentiable, these externa must occur at stationary points. Therefore, it must be that four of the computed stationary points are local maxima of the function (of value 3J3kl) and the other four are minima (of value — 3J3kl)- □
8.H.6.   Determine the global externa of the function
f(x, y) = x2 — 2y2 + Axy — Qx — 1 on the set of points [x, y] that satisfy the inequalities
(1) x > 0,   y > 0,   y < -x + 3.
Solution. We are given a polynomial with continuous partial derivatives on a compact (i. e. closed and bounded) set. Such a function necessarily has both a minimum and a maximum on this set, and this can happen only at stationary points or on the boundary. Therefore, it suffices to find stationary points inside the set and the ones on a finite number of open (or singleton) parts of the boundary, then evaluate / at these points and choose the least and the greatest values. Notice that the set of points determined by the inequalities (1) is clearly a triangle with vertices at [0, 0], [3,0], [0,3].
Let us determine the stationary points inside this triangle as the solution of the equations fx = 0, fy = 0. Since
fx{x,y) = 2a; + 4y-6,    fy(x,y) =4x - Ay,
these equations are satisfied only by the point [1,1]. The boundary suggests itself to be expressed as the union of three line segments given by the choice of pairs of vertices. First, we consider x = 0, y e [0,3], when j(x,y) = —2y2 — 1. However, we know the graph of this (univariate) function on the interval [0,3] It is thus not difficult to find the points at which global externa occur. They are the marginal points [0, 0], [0,3]. Similarly, we can consider y = 0, x e [0, 3], also obtaining the marginal points [0, 0], [3, 0]. Finally, we get to the line segment y = —a; + 3, a; e [0,3]. Making some rearrangements, we get
f(x, y) = f(x, -x + 3) = -5a;2 + 18a; - 19,   xe [0, 3].
Further, composing the mapping F with any linear mapping G yields a differentiable mapping again, and it is known how the differential changes.    The choice
G(y) = (D1av(0))-1(y) gives D1(G o F)(0) = idK„ and thus we may assume that
D1F(0) = idK„ .
With these assumptions, consider the mapping K(x) = F(x) — x. This mapping is also differentiable, and its differential at 0 is zero.
It is already known that each continuously differentiable mapping is Lipschitz continuous over every ^-neighbourhood Us of the origin (in the its domain),
\\K(x)-K(y)\\ <C\\x-y\\,
where C is bounded by the maximum of all absolute values of the partial derivatives in the Jacobi matrix of the mapping K in the neighbourhood Us, cf. 8.1.19.
Since the differential of the mapping K at the point xo = 0 is zero, one can, by selecting a sufficiently small neighbourhood U of the origin, achieve the bound
\\K{x)-K{y)\\<l-\\x-y\\.
It follows by the triangle inequality that
\\x - y\\ = \\(F(x) - K(x)) - (F(y) - K(y))\\
< \\F(x)-F(y)\\ + \\K(x)-K(y)\\
< \\F(x)-F(y)\\ + ^\\x-y\\
and hence
l\\x-y\\ < \\F(x) - F(y)\\.
With this estimate, if x ^ y are both in the neighbourhood U = Us, then also F(x) =^ F(y). Therefore, the mapping is bijective onto its image V = F(U). Write _F_1 for its inverse defined on V. For this mapping, the latter estimate says
||f-1(a;)-f-1(y)|| <2||a;-y||,
so this mapping is not only continuous (as we assumed in our first step in the proof), but also Lipschitz-continuous, as requested in the end of the previous part of the proof.
It could seem that the proof is complete, but this is not so. .   jj^/' To finish, it is necessary to show that the map-Pm8 F restricted to a sufficiently small neigh-^L^^ofe', bourhood Us is not only bijective onto its im-*Za' age, but also that it maps open neighbourhoods of zero onto open neighbourhoods of zero.2
Decrease the latter neighbourhood U = Us so that the above estimates are true for the boundary of U as well and at the same time the Jacobi matrix of the mapping is invertible on all of U. This can be done since the determinant is a continuous mapping. Let B denote the boundary of the set U,
In the literature, there are examples of mappings which continuously and bijectively map a line segment onto a square. So this is not an obvious requirement.
538
CHAPTER 8. CALCULUS WITH MORE VARIABLES
We thus need to find the stationary points of the polynomial p(x) = -5x2 + 18x - 19 from the interval [0,3]. The equation p'(x) = 0, i. e., — IQx + 18 = 0, is satisfied by x = 9/5. This means that in the last case, we obtained one more point (besides the marginal points), namely [9/5,6/5], where a global extremum may occur. Altogether, we have these points as "suspects":
[1,1],   [0,0],   [0,3],   [3,0], [§,§] with function values
-4,   -1,   -19,   -10, -f, respectively. We can see that the function / takes on the greatest value —1 at the point [0,0] and the least value —19 at the point [0,3]. □
8.H.7. Determine whether the function / : R3 -> R, f(x,y,z) = y2z has any externa on the line segment given by the equations 2x + y + z = 1, x — y + 2z = 0 and the constraint x e [—1,2]. If so, find these externa and determine their types. Justify all of your decisions.
Solution. We are looking for the externa of a continuous function on a compact set. Therefore, the function must have both a minimum and a maximum on this set, and this will happen either at the marginal points of the segment or at those where the gradient of the examined function is a linear combination of the gradients of the functions that give the constraints. First, let us look for the points which satisfy the gradient condition:
0 = 2k + I, 2yz = k — /, y2 = k + 2l, 2x + y + z = 1, x — y + 2z = 0.
The solution of the system is [x,y,z] = [|,0, —^] and [x,y,z] = [|, |, — i] (of course, the variables k and / can also be computed, but we are not interested in them). The marginal points of the given line segment are [—1,|,|] and [2, — |,—|]. Considering these four points, the function takes on the greatest value at the first marginal point (f(x, y, z) = ^), which is its maximum on the given segment, and it takes the least value at the second marginal point (f(x, y,z) = — 1^), which is thus its minimum there. □
that is, the corresponding sphere. Since B is compact and F is continuous, the function
p{x) = \\F{x)\\
achieves both the maximum and the minimum on B. Denote a = \ minxGB p(x) and consider any y G Oa(0) fixed. Of course, a > 0 because x = 0 is the only point with F(x) = 0 within Us. It is necessary to show that there is at least one x e U such that y = F(x), which completes the proof of the inverse mapping theorem.
For this purpose, consider the function (y is a fixed point)
h(x) = \\F(x)-y\\2.
Again, the image h(U) U h(B) must have a minimum. This minimum cannot occur for x e B.
Notice that F(0) = 0, hence h(0) = \\y\\ < a. At the same time, the distance of y from F(x) for x e B is at least a for all y e Oa(0) (since a is selected to be half the minimum of the magnitude of F(x) on the boundary). Therefore, the minimum occurs inside U, and it is a stationary point z of the function h. Fixing such z, means that for all j = 1,..., n,
i=l J
This is a system of linear equations with variables & = fi (z) — yi and coefficients given by twice the Jacobi matrix D1F(z). In particular, for z e U, such a system has a unique solution, and this is zero since the Jacobi matrix is invertible.
In this way the desired point x = z e U is found, satisfying, for alH = 1,... ,n, the equality fi(z) = y{, i.e.,
F(z) =y. □
8.1.23. The implicit functions. The next goal is to employ n the inverse mapping theorem for clarifying the •<i\ properties of implicitly defined functions. To ^^SS^Tjt start, consider a differentiable function F (x, y) defined in the plane E2, and look for those points (x, y) where F(x,y) = 0.
An example of this can be the usual (implicit) definition of straight lines and circles:
F(x, y) = ax + by + c = 0,    a, b, c £ R
F(x, y) = (x- s)2 + (y - t)2 - r2 = 0, r > 0.
While in the first case, the relation between the quantities x and y can be expressed as the function (for b ^ 0)
w ,      a c
y = f(x) = --x--
for all x; in the other case, for any point (x0, y0) satisfying the equation of the circle and such that yo =^ t (these are the marginal points of the circle in the direction of the coordinate x), There is a neighbourhood of the point x0 in which either
V = f(x) = 1 + \/{x - s)2 -r,
or
V = f(x) = 1 - \/{x - s)2 -r,
539
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.H.8. Find the maximal and minimal values of the polynomial
p(x, y) = 4a;3 — 3a; — 4y3 + 9y
on the set
M = {[x,y] e R2; a;2 + y2 < 1} .
Solution. This is again the case of a polynomial on a compact set; therefore, we can restrict our attention to stationary points inside or on the boundary of M and the "marginal" points on the boundary of M. However, the only solutions of the equations
px(x,y) = 12a;2 -3 = 0,   py(x,y) = -12y2 + 9 = 0 are the points
[i vll    [i _vll    Li vll    Li _vll
[2'   2 J '      [2'       2 J '      [    2'   2 J '      [    2'       2 J '
which are all on the boundary of M. This means that p has no extrémům inside M. Now, it suffices to find the maximum and minimum of p on the unit circle k : x2 + y2 = 1. The circle k can be expressed parametrically as
x = cost,    y = siní,       t £ [—tt,tt]. Thus, instead of looking for the extrema of p on M, we are now seeking the extrema of the function
fit) := p(cost,siní) = 4cos31 — 3cost — 4sin31 + 9siní
on the interval [—tt, tt]. For t £ [—tt, tt], we have
/'(í) = -12cos2ísiní + 3siní - 12sin2ícosí + 9cosí,
In order to determine the stationary points, we must express the function /' in a form from which we will be able to calculate the intersection of its graph with the a>axis. To this purpose, we will use the identity
—kr- = 1 + tg21, which is valid provided both sides are well-defined. We get /'(f) = cos3f[-12tgf + 3(tgf+ tg3f) -12tg2í + 9(l + tg2í)]
for t £ [—tt, tt] with cos t ^ 0. However, this condition does not exclude any stationary points since siní ^ 0 if cos t = 0. Therefore, the stationary points of / are those points t £ [—tt, tt] for which
-4 tg t + tg t + tg31 - 4 tg21 + 3 + 3 tg21 = 0. The substitution s = tg t leads to
s3 - s2 - 3s + 3 = 0,   i. e.    (s - 1) (s - a/Š) (s + V3)= 0. Then, the values
according to whether (a;0, yo) belongs to the upper or lower semicircle.
If a diagram of the situation is drawn, the reason is clear: describing both the semicircles simultaneously by a single function y = j(x) is not possible. The boundary points of the interval [s — r,s + r] are even more interesting. They also satisfy the equation of the circle with y = t, yet Fy (s±r, t) = 0, which describes the position of the tangent line to the circle at these points, parallel to the y-axis. There are no neighbourhoods of these points in which the circle could be described as a function y = f(x).
Moreover, the derivatives of the function y = j(x) = t+ \J(x — s)2 — r2 can be easily expressed in terms of partial derivatives of the function F:
1     2{x — s) x — s Fx
yV2 — (x — s)2
y-t
F,
If the roles of the variables x and y are interchanged and a relation x = /(y) such that F(f(y),y) = 0 is sought, then neighbourhoods of the points (s ± r, t) are obtained with no problem. Notice that the partial derivative Fx is non-zero at these points.
So it is observed (though for only two examples): for a function F(x, y) and a point (a, b) e E2 such that F(a,b) = 0, there is the unique function y = j(x) satisfying F(x,j(x)) = 0 on some neighbourhood of x if Fy(a,b) =^ 0. In this case,/'(a) = —Fx(a,b)/Fy(a,b) can even be computed. We prove that in fact, this proposition is always true.
The last statement about derivatives can be remembered (and is quite comprehensible if things are properly understood) from the expression for the differential of the (constant) function g(x) = F(x, y(x)) and the differential dy =
f'(x)dx
Q = dg = Fxdx + Fydy = (Fx + Fyf'(x))dx.
One can work analogously with the implicit expressions F(x,y,z) = 0, to look for a function g(x,y) such that F(x, y, g(x, y)) = 0. As an example, consider the function f(x, y) = a;2 + y2, whose graph is the rotational paraboloid centered at the point (0, 0). This can be defined implicitly by the equation
0 = F(x, y,z) = z — x2 — y2.
Before formulating the result for the general situation, notice which dimensions could/should appear in the problem.  If it is desired to find, for this function F, a curve
c(x) = (ci (x), c2(x)) in the plane such that
F(x,c(x)) = F(x,c1(x),c2{x)) = 0,
then this can be done (even for all initial conditions x = a), yet the result is not unique for a given initial condition. It suffices to consider an arbitrary curve on the rotational paraboloid whose projection onto the first coordinate has a non-zero derivative. Then consider x to be the parameter of the curve, and c(x) to be its projection onto the plane yz.
540
CHAPTER 8. CALCULUS WITH MORE VARIABLES
s = 1,   s = V3,   s = — \/% respectively correspond to
t G {-§tt, ±7t},    t€{-fir,iir},    t e {-§ tt, § tt}.
Now, we evaluate the function / at each of these points as well as at the marginal points t = — tt, t = 7r. Sorting them, we get
/ (-± tt) = -1 - 3^ < / (-| tt) = -3y/2 < /(-§*■) =1-3v/3<-1,
/ (-7T) = / (TT) = -1< 0,
/ (|^) = 1 + 3^ > / (Itt) = 3^ > / (±tt) = -l + 3\/3> 0.
Therefore, the global minimum of the function / is at the point t = —7r/3 , while the global maximum is at t = 27r/3.
Now, let us get back to the original function p. Since we know the values cos (—i 7r) = i, sin (— 17r) = — cos (| 7r) = — |, sin (| 7r) = we can deduce that the polynomial p takes on the minimal value —1—3^ (the same as /, of course) at the point [1/2, — v/3/2] and the maximal value 1 + 3^ at [-1/2, VH/2]. □
8.H.9.   At which points does the function
f(x,y) = x2 -Ax + y2 take on global externa on the set M : x | + | y < 1? Solution. Expressing / in the form
f(x,y) = (x-2)2-A + y2,
we can see that the global maximum and minimum occur at the same points as for the function
g(x, y) := ^(x - 2)2 + y2,    [x, y] G M,
since neither shifting the function nor applying the increasing function v = ^fu for u > 0 changes the points of externa (of course, they can change their values). However, we know that the function g gives the distance of a point [x, y] from the point [2,0]. Since the set M is clearly a square with vertices [1,0], [0,1], [-1,0], [0,-1], the point of M that is closest to [2,0] is the vertex [1,0], while the most distant one is [—1,0]. Altogether, we have obtained that the minimal value of / occurs at the point [1,0] and the maximal one at [—1, 0]. □
8.H.10. Compute the local externa of the function y = j(x) given implicitly by the equation
3a;2 + 2a;y + a; = y2 + 3y + |,    [x,y] e R2 \ {[x,x- |] ; x GR}.
Therefore, it is expected that one function of m + 1 variables defines implicitly a hypersurface in Rm+1 which is to be expressed (at least locally) as the graph of a function of m variables. It can be anticipated that n functions oim + n variables define an intersection of n hypersurfaces in Rm+n, which is expected as an "m-dimensional" object.
8.1.24. The general theorem. Consider a differentiable mapping
F=(fi,...,fn)
The Jacobi matrix of this mapping has n rows and m + n columns. Write it symbolically as
D1F =
/Hi
\ & l
9/l
dxr,
dxr,
9/i
9/i
A
dxj,
is written as (x, y) G
= (DlF,DlF),
where (x1,... ,xm+n) e Rn, DXF is a matrix of n rows and the first m columns in the Jacobi matrix, while DyF is a square matrix of order n, with the remaining columns. The multidimensional analogy to the previous reasoning with the non-zero partial derivative with respect to y is the condition that the matrix DyF is invertible.
The implicit mapping theorem
Theorem. Let F : Rm+n -> Rn be a differentiable mapping in an open neighbourhood of a point (a, b) G Rm x Rn = Rm+n at which F{a,b) = Q, and det DyF ± 0.
Then there exists a differentiable mapping G : Rm —> Rn defined on an neighbourhood U of the point a G Rm with image G(U) which contains the point b and such that F(x,G(x)) = 0 for all x G U.
Moreover, the Jacobi matrix D1G of the mapping G is, in the neighbourhood of the point a, given by the product of matrices
D^Gix) = -{DlF)-\x,G{x)) ■ DlF{x,G{x)).
Proof. For the sake of comprehensibility, first show the proof for the simplest case of the equation F(x, y) = 0 with a function F of two variables. Jh^% At first sight, it might look complicated, but this J— situation can be discussed in a way which can be extended for the general dimensions as in the theorem, almost without changes.
Extend the function F to
F :
(x,y) ^ (x,F(x,y)).
The Jacobi matrix of the mapping F is
1 0
Fx{x,y) Fy(x,y)
541
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Solution. In accordance with the theoretical part (see 8.1.24), let us denote
F(x, y) = 3x2 + 2xy + x - y2 - 3y - §, [x,y] £R2\{[i,i-|];i£lR}
and calculate the derivative
m' - f'M -     FAx,v) _ 6x+2y+l y   — J —     Fv{x,y) ~ 2x-2y-3-
We can see the this derivative is continuous on the whole set in question. In particular, the function / is denned implicitly on this set (the denominator is non-zero).
A local extremum may occur only for those x, y which satisfy y' = 0, i. e., Qx + 2y + 1 = 0. Substituting y = —3a;—1/2 into the equation F(x,y) = 0, we obtain—12a;2 + Qx = 0, which leads to
[x,y]=[0,-±], [x,y]=[±,-2].
We can also easily compute that
,," - f„'Y - {6+2y')(2x-2y-3)-(6x+2y+i){2-2y')
y    - yy )   - (2x-2y-3)2
Substituting x = 0, y = -1/2, y' = 0 and x = 1/2, y = -2, y' = 0, we obtain
y„ = _6(^iz2 >0 &a[x,y}=[0
and
y
i, _ 6(+2)-0
< 0   for [x, y]
We have thus proved that the implicitly given function has a strict local minimum at the point x = 0 and a strict local maximum at x = 1/2. □
8.H.11. Find the local externa of the function z = /(a;, y) given on the maximum possible set by the equation
(1)    a;2 + y2 + z2 - xz - yz + 2x + 2y + 2z - 2 = 0.
Solution. Differentiating (1) with respect to x and y gives
2a; + 2zzx — z — xzx — yzx + 2 + 2zx = 0, 2y + 2zzy — a^ — z — yzy + 2 + 2zy = 0. Hence we get that
z -2a;-2
(2)
zx = fx(x,y) =
2z-a;-y + 2 z -2y-2
*y-jyy-,a/ 2z-x-y + 2-We can notice that the partial derivatives are continuous at all points where the function / is denned. This implies that the local externa can occur only at stationary points. These points satisfy
zx = 0, i. e. z — 2x — 2 = 0, zy = 0,   i. e.   z — 2y — 2 = 0.
It follows from the assumption fy (a, 6) 0, that the same also holds in a neighbourhood of the point (a,b), so the function f is invertible in this neighbourhood, by the inverse mapping theorem. Therefore, there is a uniquely denned differen-tiable inverse mapping f-1 in a neighbourhood of the point (a,0).
Denote by it : R2 —> R the projection onto the second coordinate, and consider the function j(x) = it o f~1(x, 0). This function is well-defined and differentiable. It must be verified that the expression
f{x, f{x)) = f(x^{f-\x,0)))
is zero in a neighbourhood of the point x = a. It follows directly from the definition of f(x, y) = (x, f(x, y)) that its inverse is of the form f-1 (x, y) = (x,ttf~1(x,y)). Therefore, the previous calculation can be resumed:
F(x,f(x)) = ir(F(x,ir(F-1(x,0))))
= tt(F(F-1(x,0))) = tt(x,0) = 0.
This proves the first part of the theorem, and it remains to compute the derivative of the function j(x). This derivative can, once again, be obtained by invoking the inverse mapping theorem, using the matrix (D1f)~1.
The following equality is easily verified by multiplying the matrices. It can also be computed directly using the explicit formula for the inverse matrix in terms of the determinant and the algebraically adjoint matrix, see paragraph 2.2.11
1 0
Fx{x,y) Fy{x,y)
By the definition j(x) = 7r_F_1(a;,0), and thus the first entry of the second row of this matrix is the derivative f'(x) with y = f(x), i.e. the Jacobi matrix D1 f. In this simple case, it is exactly the desired scalar —Fx(x, j(x))/Fy(x, j(x)). The general proof is exactly the same, there is no need to change any of the formulae. We obtain the invertible mapping F : Rm+n ->• Rm+n and define G(x) = ttF-^x, 0), where tt : Rm+n -> R™, tt(x, y) = y. The same check as above reveals that F(x, G(x)) = 0) as requested. Only in the last computation of the derivative of the function do the corresponding parts of the Jacobi matrix D\F and DyF appear, instead of the particular partial derivatives.
For the calculation of the Jacobi matrix of the mapping G, use the computation of the inverse matrix. This time the algebraic procedure from paragraph 2.2.11 is not very advantageous. It is better to be guided by the case in dimension m + n = 2 and to divide the matrix
(D1F~1) =
idKm 0     \        (A B
KDlF{x,y)   DlF{x,y))        \C D
into blocks of m and n rows and columns (for instance A is of type m x m, while C is of type n x m). Now, the matrices
542
CHAPTER 8. CALCULUS WITH MORE VARIABLES
We have thus two equations, which allow us to express the dependency of x and y on z. Substituting into (1), we obtain the points
[x,y,z] = [-3 + ^,-3 + ^,-4 + 2^] , [x,y,z] = [-3-^,-3-^,-4-2^] .
Now, we need the second derivatives in order to decide whether the local extrema really occur at the corresponding points. Differentiating zx in (2), we obtain
y     - f    (t i,\ - (^-2)(2z-x-ty+2)-(z-2x-2)(2zx-l) Zxx — Jxx(X, y) — (2z-x-y+2Y '
with respect to x, and
_ f i \ _ zy(2z-x-y+2)-(z-2x-2)(2zy-l) Zxy - Jxy(X,y) - (2z-x-y+2)2
°xy      J xy \
with respect to y. We need not calculate zyy since the variables x and y are interchangeabel in (1) (if we swap x and y, the equation is left unchanged). Moreover, the x- and y-coordinates of the considered points are the same; hence zXx = Zyy. Now, we evaluate that at the stationary points:
fxx (-3 + VG, -3 + VG) = fyy (-3 + VG, -3 + VG) = __i_
%/6'
fxy (-3 + VG, -3 + VG) = fyx (-3 + VG, -3 + VG) = 0,
fxx (-3 -y/6,-3-y/6)= fyy (-3 -y/6,-3-V$) =
1
\/6'
fxy (-3 - VG, -3 - VG) = fyx (-3 -VG,-3-Vg) = 0.
As for the Hessian, we have
Hf {-3+VG,-3 + VG)=[f
i— 0
Hf (-3-y/6,-3-y/E)=[f j_
Apparently, the first Hessian is negative definite, while the second one is positive definite. This means that there is a strict local maximum of the function / at the point [—3 + \/6, —3 + \/6], and there is a strict local minimum at the point [-3 - VG, -3 -      . □
8.H.12.   Determine the strict local extrema of the function
f{x,y) = -\ + -\,    x^O, y^O on the set of points that satisfy the equation \ + -\ = 4.
X y
Solution. Since both the function / and the function given implicitly by the equation ^ + _r — 4 = 0 have continuous
A, B, C, D can be determined from the denning equality for the inverse:
idKm 0     \  (A   B\     /idKm 0
DlF{x,y)   DlF{x,y))\c   D)     \0 idK,
Apparently, it follows that A = idKm, B = Q,D= (DyF)-1, and finally, D\F + DxyF ■ C = 0. The latter equality implies already the desired relation
D1G = C = —(DyF)-1 ■ D\F.
This concludes the proof of the theorem. □
8.1.25. The gradient of a function. As seen in the pre-«©t        vious paragraph, if F is a continuously differ-entiable function of n variables, the definition
F(x1,..., xn) = b with a fixed value b e K defines the subset Mb C K™ which mostly has the properties of an (n — 1)-dimensional hypersurface. To be more precise, if the vector of the partial derivatives
I i{ ^p
D1F =
OF OF
^dx1 dxn
is non-zero, the set M can be described locally as the graph of a continuously differentiable function of n — 1 variables. In this connection, there are level sets Mb.
The vector D1F e K™ is called the gradient of the function F. In technical and physical literature, it is also often denoted as grad F.
Since Mb is given by a constant value of the function F, the derivatives of the curves lying in M have the property that the differential dF always evaluates to zero along them. For every such curve, F(c(t)) = b, hence
d
dt
F(c(t)) = dF(c'(t)) = 0.
On the other hand, we can consider a general vector v = ..., ii„) G R™ and the magnitude of the corresponding directional derivative
\dvF\ = |—-hi + • ax i
df
+ g^-«n =cosv?||ß12i'||||i;||,
partial derivatives of all orders on the set ]
{[0,0]}, we
where ip is the angle between the directions of the vector v and the gradient F, see the discussion about angles of vectors and straight lines in the fourth chapter (cf. definition 4.1.18).
b ť      v / Zdebysehodil
Thus is observed: obrázek, napr. obr. 2
The maximal growth of a function
Proposition. The gradient D1F = (J^, ■ ■ ■, Jj-) provides the directions of maximal growth of the function F of n variables.
Moreover, the vanishing directional derivatives are exactly those in directions perpendicular to the gradient.
Therefore, it is clear that the tangent plane to a non-empty level set Mb in a neighbourhood of its point with non-zero gradient D1F is determined by the orthogonal complement
543
CHAPTER 8. CALCULUS WITH MORE VARIABLES
should look for stationary points, i. e., for the solution of the equations Lx = 0, Ly = 0 for
L(a;,y,A) = i + i-A(-Jr + 4r-4), x^O, y ^ 0. We thus get the equations
x2 x3   — U, y2 y3   — U,
which lead to a; = 2A, y = 2A. Considering the set of points in question, the constraint x = y gives the stationary points
(1)
V2 V2
~2~' T~
V2 V2
"~2~' 2~
Now, let us examine the second differential of the function L. We can easily compute that
r      _ _2_ _ 6A        r      _ n       r      __2__6A /
^ra       x3        x4 5      ^Xy       u5 TO       ^3        ^4 5      ^ 7~
whence it follows that
d2L(x,y)
< 2__6A
v x3 X'
i) dx2 +
2 6A
i)dy2
Differentiating the constraint    + -r = 4, we get
X y
-Jr da; - rfa dy = 0,   i.e.   dy2 = ^ dx2. Therefore,
d2i(x,y)=[jr-^ + (^-p)^]dx2. In fact, we are considering a one-dimensional quadratic form whose positive (negative) definiteness at a stationary point means that there is a minimum (maximum) at that point. Realizing that the stationary points had x = 2A, y = 2A, mere substitution yields
<Pl(&,&) =-4j2dx2,   d2h{-^,-^f) = 4V2 dx2,
which means that there is a strict local maximum of the function / at the point [v/2/2, v/2/2], while at the point [—a/2/2, —\f2/2~\, there is a strict local minimum. The corresponding values are: (2)
>(-£-£)=-^
Now, we will demonstrate a quicker way how to obtain the result. We know (or we can easily calculate) the second partial derivatives of the function L, i. e., the Hessian with respect to the variables x and y:
2__6A
HL{x,y)= ( -3 0 -4
0
2__6A
y3 yA
to the gradient, and the gradient itself is the normal vector of the hypersurface Mb.
For instance, considering a sphere in R3 with radius r > 0, centered at (a, b, c), i.e. given implicitly by the equation
F(x, y, z) = {x- a)2 + (y - b)2 + (z - c)2 = r2,
The normal vectors at a point P = (x0, y0, z0) are obtained as a non-zero multiple of the gradient, i.e. a multiple of
D1F = (2(x0 — a), 2(y0 — b), 2(z0 — c)),
and the tangent vectors are exactly the vectors perpendicular to the gradient. Therefore, the tangent plane to a sphere at the point P can always be described implicitly in terms of the gradient by the equation
0 = (x0-a)(x-x0) + (y0-b)(y-y0) + (z0-c)(z-z0). This is a special case of the following general formula: Tangent hyperplanes to level sets
Theorem. For a function F [xi,... ,xn) of n variables and a point P = (pi,... ,pn) in a level set Mb of the function F such that the gradient D1F is non-vanishing at P, the implicit equation for the tangent hypersurface to Mb is
0 :
df_
dx1
(P)(x1-Pl) +
■Pn)
Proof. The statement is clear from the previous discussions. The tangent hyperplane must be (n — l)-dimensional, so its direction space is given as the kernel of the linear form given by the gradient (zero values of the corresponding linear mapping R™ —> R given by multiplying the column of coordinates by the row vector grad F). Clearly, the selected point P satisfies the equation. □
8.1.26. Illumination of 3D objects. Consider the illumination of a three-dimensional object where the direction v of the light falling onto the two-dimensional surface M of this object is known. Assume M is given implicitly by an equation
F(x,y,z) = 0.
The light intensity at a point P G Mis defined as I cos ip, where ip is the angle between the normal line to M and the vector which is opposite to the flow of the light. As seen, the normal line is determined by the gradient of the function F. The sign of the expression then says which side of the surface is illuminated.
For example, consider an illumination with constant intensity To in the direction of the vector v = (1,1, —1) (i.e. "downward askew"), and let the ball given by the equation F(x, y,z) = x2 + y2 + z2 — 1 < 0 be the object of interest. Then, for a point P = (x, y,z) e M on the surface, the intensity
I(P) =
grad F ■ v
-2x -2y + 2z "-'o —-7r-7=-h
The evaluation
|gradF|||Hru 2v3 is obtained. Notice that, as anticipated, the point which is illuminated with the (full) intensity Iq is the point P =
544
CHAPTER 8. CALCULUS WITH MORE VARIABLES
HL
V2 V2
-2^2 0
0
-2V2
HL\-A-^ 1     2 ' 2
2V2 0 0 2V2
then tells us that the quadratic form is negative definite for the former stationary point (there is a strict local maximum) and positive definite for the latter one (there is a strict local minimum).
We should be aware of a potential trap in this "quicker" method in the case we obtain an indefinite form (matrix). Then, we cannot conclude that there is not an extremum at that point since as we have not included the constraint (which we did when computing d2 L), we are considering a more general situation. The graph of the function / on the given set is a curve which can be denned as a univariate function. This must correspond to a one-dimensional quadratic form. □
8.H.13.   Find the global externa of the function
f(x,y) = ± + ±, x^0,y^0 on the set of points that satisfy the equation \ + \ = 4.
X y
Solution. This exercise is to illustrate that looking for global externa may be much easier than for local ones (cf. the above exercise) even in the case when the function values are considered on an unbounded set. First, we would determine the stationary points (1) and the values (2) the same way as above. Let us emphasize that we are looking for the function's externa on a set that is not compact, so we will not do with evaluating the function at the stationary points. The reason is that the function / may not have an extremum on the considered set - its range might be an open interval. However, we will show that this is not the case here.
Let us thus consider x > 10. We can realize that the equation ^ + ^ = 4 can be satisfied only by those values y for which y > 1/2. We have thus obtained the bounds
-2V2 <-±-2< f(x,y) < i + 2 < 2^2,   if    x I > 10.
At the same time, we have (interchanging x and y leads to the same task)
-2V2 <-±-2< f(x,y) < i + 2 < 2^2,    if    y I > 10.
Hence we can see that the function / must have global externa on the considered set, and this must happen inside
-^(—1,-1,1) on the surface of the ball, while the antipodal point is fully illuminated with the minus sign (i.e. on the Zde by se mozna
■ j     c,-,        ,       x hodilo i neco vie, aspon
mside of the sphere). obrazek, neco Jai0 ir.
3
8.1.27. Tangent and normal spaces. Ideas about tangent ■£s ,;    and normal lines can be extended to general di-3£tk     mensions- Wim a mapping F : Rm+n -> Rn, and coordinate functions fi, one can also consider the n equations for n + m variables
fi (xi, • • • , ^m+n)      bi,     i      1, . . . , 77.,
expressing the equality F(x) = b for a vector b e Rn.
Assuming that the conditions of the implicit function theorem hold, the set of all solutions (x1,..., xm+n) e Rm+n is (at least locally) the graph of a mapping G : Rm —> Rn. Technically, it is necessary to have some submatrix in D1 F of the maximal possible rank n.
For a fixed choice b = (&!,..., bn), the set of all solutions is, of course, the intersection of all hypersurfaces M(pi,fi) corresponding to the particular functions fi. The same must hold for tangent directions, while normal directions are generated by the individual gradients. Therefore, if D1 F is the Jacobi matrix of a mapping which implicitly defines a set M and P = (p±,... ,pm+n) e M is a point such that M is the graph of a mapping in the neighbourhood of the point P,
(
9£l
D1F =
A
_9k
then the affine subspace in Rm+n which contains exactly all tangent lines going through the point P is given implicitly by the following equations:
-Pl)-\-----1" ^-(P)(xm+n
' Pm+n )
9fn 9fn 0 = (P) {Xl - Pi) H-----h T^- (P) (Xm+n - Pm+n) ■
This subspace is called the tangent space to the (implicitly given) m-dimensional surface M at the point P.
The normal space at the point P is the affine subspace generated by the point P and the gradients of all the functions /1,..., /„ at the point P, i.e. the rows of the Jacobi matrix D1F.
As an illustrative simple example, calculate the tangent and normal spaces to a conic section in R3. Consider the equation of a cone with vertex at the origin,
0 = f(x, y,z) = z- \Jx2 + y2,
and a plane, given by
0 = g(x, y, z) = z — 2x + y + 1.
The point P = (1, 0,1) belongs to both the cone and the plane, so the intersection M of these surfaces is a curve (draw
545
CHAPTER 8. CALCULUS WITH MORE VARIABLES
the square ABCD with vertices A = [-10,-10], B = [10, -10], C = [10,10], D = [-10,10].
The intersection of the "hundred times reduced" square with vertices at I = [—1/10, —1/10], B = [1/10,-1/10], C = [1/10,1/10], D = [-1/10,1/10] and the given set is clearly the empty set. Therefore, the global extrema are at points inside the compact set bounded by these two squares. Since / is continuously differentiable on this set, the global extrema can occur only at stationary points. We thus must have
f    - f (i/l        - 2A/2     f ■ - f -
Jmax — J I   2 '   2   / —    * ^:     Jmm — J I      2 :      2  1 — -2V2.
□
8.H.14. Determine the maximal and minimal values of the function f(x, y, z) = xyz on the set M given by the conditions
x2+y2 + z2 = l,    x + y + z =0.
Solution. It is not hard to realize that M is a circle. However, for our problem, it is sufficient to know that M is compact, i. e. bounded (the first condition of the equation of the unit sphere) and closed (the set of solutions of the given equations is closed since if the equations are satisfied by all terms of a converging sequence, then it is satisfied by its limit as well). The function / as well as the constraint functions F(x,y,z) = x2 + y2 + z2 — 1, G(x,y,z) = x + y + z have continuous partial derivatives of all orders (since they are polynomials). The Jacobi constraint matrix is 'Fx(x,y,z)    Fy(x,y,z) Fz(x,y,zY
a diagram!). Its tangent line at the point P is given by the following equations:
KGx(x,y,z)   Gy(x,y,z) Gz(x,y,z)/ <2x   2y 2zN 1     1 1
Its rank is reduced (less than 2) if and only if the vector (2x, 2y, 2z) is a multiple of the vector (1,1,1), which gives x = y = z, and thus x = y = z = 0 (by the second constraint). However, the set M does contain the origin. Therefore, we may look for stationary points using the method of Lagrange multipliers. For
L(x,y,z,X1,X2) = xyz — X1 (x2 + y2 + z2 — l) — A2 (x + y + z) ,
the equations Lx = 0, Ly = 0, Lz = 0 give
yz - 2Xix - X2 = 0, xz - 2Xiy - X2 = 0,
0 = -
1
2V/^2T
-.2x
y
(x-l)
x=l,y=0
l
y + l- (z-l)
2sjx2 + y2 x=i,y=o = —x + z
0 = -2(x - 1) + y + (z - 1) = -2x + y + z + l,
while the plane perpendicular to the curve, containing the point P, is given parametrically by the expression
(l,0,l)+T(-l,0,l) + a(-2,l,l)
with real parameters r and a.
8.1.28. Constrained extrema. Now we come with the first really serious application of the differential calculus of more variables. The typical task in optimization is to find the extrema of values depending on several (yet finitely many) parameters, under some further constraints on the parameters of the model.
The problem often has m + n parameters constrained by n conditions. In the language of differential calculus, it is desired to find the extrema of a differentiable function h on the set M of points given implicitly by a vector equation F(x1,..., xm+n) = 0. Of course, we might first locally parameterize the solution space of the latter equation by m free parameters, express the function h in terms of these parameters and look for the local extrema by the inspecting the critical points. However, we have already prepared more efficient procedures for this effort.
For every curve c(t) c M going through P = c(0), it must be ensured that h(c(t)) is an extremum of this univariate function. Therefore, the derivative must satisfy
!w))it=o
dc,(0)h{P) = dh{P){c'{Q)) = Q.
This means that the differential of the function h at the point P is zero along all tangent increments to M at P. This property is equivalent to stating that the gradient of h lies in the normal subspace (more precisely, in the modelling vector space of the normal subspace). Such points P e M are called stationary points of the function h with respect to the constraints given by F.
As seen in the previous paragraph, the normal space to the set M is generated by the rows of the lacobi matrix of the mapping F, so the stationary points are described equiv-alently by the following proposition:
546
CHAPTER 8. CALCULUS WITH MORE VARIABLES
xy - 2Ai2 - A2 = 0,
respectively. Subtracting the first equation from the second one and from the third one leads to
xy -
-yz yz -
- 2\iy + 2\ix --2X1z + 2Aia; :
0, = 0,
(x-(x-
y) (* + 2Ai)
z) (y + 2Ai)
0, 0.
The last equations are satisfied in these four cases:
x = y,x = z;    x = y,y = -2Xx; z = —2Xi,x = z\    z = — 2Xi,y = — 2Xi, thus (including the constraint G = 0)
x = y = z = 0;    x = y = —2\\,z =4Ai;
x = z =— 2Ai,y = 4Ai;   x = 4Ai, y = z = — 2Ai.
Except for the first case (which clearly cannot happen), including the constraint F = 0 yields
(4A1)2 + (-2A1)2 + (-2A1)2 = l,   i. e.   \i = ±£i
Altogether, we get the points
"2%/6'
1__1_ _2_
s/6'     s/6' s/E\ '
1      1__2_1
_s/6' s/6'     s/E\ '
1      2__1_
V6' \/6'   VeJ '
1__2_    1 1
s/6'     s/6' s/6} '
2__]___]_
\/6'     \/6' \/6
2 11 V5' s/6' s/6
We will not verify that these really are stationary points. The only important thing is that all stationary points are among these six.
We are looking for the global maximum and minimum of the continuous function / on the compact set M. However, the global externa (we know they exist) can occur only at points of local externa with respect to M. And the local externa can occur only at the aforementioned points. Therefore, it suffices to evaluate the function / at these points. Thus we find out that the wanted maximum is
71  V6' Vg'Vg
while the minimum is
/(-----
= f
2
1
1
7i
1    1 2
7i'7i'~7i
1    2 1 Ti'^Ti'Ti
■/(-- - -71 Vg'Vg'Vg
1
iTe'
1
□
Lagrange multipliers
Theorem. Let F = : Km+n       K™ be
a differentiable function in a neighbourhood of a point P, F(P) = 0. Further, let M be given implicitly by an equation F(x,y) = 0, and let the rank of the matrix D1F at the point P be n. Then P is a stationary point of a continuously differentiable function h : Rm+n —> R with respect to the constraints F, if and only if there exist real parameters Ai,..., Xn such that
grad h = \i grad /H-----h Xn grad /„.
The procedure suggested by the theorem is called the method of Lagrange multipliers. It is of algo-Spi^7 rithmic character. Consider the numbers of un-^ knowns and equations: the gradients are vectors of m + n coordinates, so the statement of the theorem yields m + n equations. The variables are, on one side, the coordinates x1,..., xm+n of the stationary points P with respect to the constraints, and, on the other hand, the n parameters X{ in the linear combination. It remains to say that the point P belongs to the implicitly given set M, which represents n further equations. Altogether, there are 2n + m equations for 2n + m variables, so it can be expected that the solution is given by a discrete set of points P (i.e., each one of them is an isolated point).
Very often, the system of equations is a seemingly simple system of algebraic equations, but in fact only rarely can it be solved explicitly. We return to special algebraic methods for systems of polynomial equations in chapter 12. There are also various numerical approaches to such systems. Theoretical details are not discussed here, but there are several solved examples in the other column, including also the illustration of how to use the second derivatives to decide about the local externa under the constraints.
8.1.29. Arithmetic mean versus geometric mean. As an
example of practical application of the Lagrange multipliers, we prove the inequality
1
(x1-\-----\-Xn) > sfi~7
for any n positive real numbers xi,..., xn. Equality occurs if and only if all the x{'s are equal.
Consider the sum x1-\-----h xn = c as the constraint for
a (non-specified) non-negative constant c. We look for the maxima and minima of the function
En) = yßl
with respect to the constraint and the assumption x\ > 0,...,
xn > 0.
The normal vector to the hyperplane M denned by the constraint is (1,..., 1). Therefore, the function / can have an externum only at those points where its gradient is a multiple of this normal vector. Hence there is the following system of
547
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.H.15.   Find the externa of the function / : R3 -> R,
f(x, y, z) = x2 + y2 + z2, on the plane x + y — z = 1 and determine their types.
Solution. We can easily build the equations that describe the linear dependency between the normal to the constraint surface and the examined function:
x = k, y = k z = —k,   k £ R.
The only solution is the point |, — |]. Further, we can notice that the function is increasing in the direction of (1,-1,0), and this direction lies in the constraint plane. Therefore, the examined function has a minimum at this point. Another solution. We will reduce this problem to finding the externa of a two-variable function on R2. Since the constraint is linear, we can express z = x + y — 1. Substituting this into the given function then yields a real-valued function of two variables: f(x,y) = x2 + y2 + (x + y — l)2 = 2x2 + 2xy + y2 — 2x — 2y + 1. Setting both partial derivatives equal to zero, we get the linear equation
Ax + 2y - 2 = 0,   Ay + 2x - 2 = 0,
whose only solution is the point [|, |]. Since it is a quadratic function with positive coefficients at the unknowns, it is unbounded on R2. Therefore, there is a (global) minimum at the obtained point. Then, we can get the corresponding point [|, |, — i] in the constraint plane from the linear dependency of z. □
8.H.16. Find the externa of the function x + y : R3 —> R on the circle given by the equations x + y + z = 1 and
x2 + y2 + z2 = A.
Solution. The "suspects" are those points which satisfy
(1,1, 0) = k ■ (1,1,1) + /- (x, y, z),    k, I £ R.
Clearly, x = y(= 1 //). Substituting this into the equation of the circle then leads to the two solutions
3
equations for the desired points:
1 1
•Err. —
22 1 6~' 3
22 1 eT'3 T
22 3~
Since every circle is compact, it suffices to examine the function values at these two points. We find out that there is a maximum of the considered function on the given circle at the former point and a minimum at the latter one. □
8.H.17. Find the externa of the function / : R3 -> R,
f(x, y, z) = x2 + y2 + z2, on the plane 2x + y — z = 1 and determine their types. O
for i = 1,..., n and A £ R.
This system has the unique solution a; i = ■ ■ ■ = xn in the set M. If the variables x{ are allowed to be zero as well, then the set M would be compact, so the function / would have to have both a maximum and a minimum there. However, / is minimal if and only if at least one of the values Xi is zero; so the function necessarily has a strict maximum at the point with Xi = ^, i = 1,..., n, and then A = ^.
By substituting, the geometric mean equals the arithmetic mean for these extreme values, but it is strictly smaller at all other points with the given sum c of coordinates, which proves the inequality.
2. Integration for the second time
We return to the process of integration, discussed in the second and third parts of chapter six. We saw that the integration with respect to the diverse coordinates can be iterated. Now we extend the concept of the Riemann integration and Jordan measure to general Euclidean spaces and, again, we shall see that the approaches coincide for many reasonable functions.
8.2.1. Integrals dependent on parameters. Recall that integrating a function f(x, y1,..., yn) of n + 1 variables with respect to the single variable x, ^3 the result is a function F(y1,..., yn) of the remaining variables. Essentially, we proved the following theorem already in 6.3.12 and 6.3.14. This is an extremely useful technical tool, as we saw when handling the Fourier transforms and convolutions in the last chapter. Previous results about externa of multivariate functions also have a direct application for minimization of areas or volumes of objects defined in terms of functions dependent on parameters, etc.
Continuity and differentiation
Theorem. Consider a continuous function f(x, y1;..., yn) defined for all x from a finite interval [a,b] and for all (yi,..., yn) lying in some neighbourhood U of a point c = (ci,..., cn) £ Rn, and its integral
F(yi,...,yn) = f(x,yi,---,yn)dx.
J a
Then function F(yi,..., yn) is continuous on U.
Moreover, if there exists the continuous partial derivative J^- on a neighbourhood of the point c, then J^- (c) exists as well and
9F , .     fb df
— (c) = / —(x,c1,...,cn)dx. 9y3        J a 9y3
548
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.H.18. Find the maximum of the function / : R2 -> R, f(x,y) = xy on the circle with radius 1 which is centered at the point [x0,yo] = [0,1]. O
8.H.19. Find the minimum of the function / : R2 -> R, / = xy on the circle with radius 1 which is centered at the point [x0,y0] = [2,0]. O
S.ff.20. Find the minimum of the function / : R2 -> R, / = xy on the circle with radius 1 which is centered at the point [x0,y0] = [2,0]. O
S.ff.27. Find the minimum of the function / : R2 -> R, f = xy on the ellipse a;2 + 3y2 = 1. O
S.ff.22. Find the minimum of the function / : R2 -> R, / = a;2y on the circle with radius 1 which is centered at the point [x0,y0] = [0,0]. O
8.H.23. Find the maximum of the function / : R2 -> R,
/(a;, y) = a;3y on the circle a;2 + y2 = 1. O
8.H.24. Find the maximum of the function / : R2 -> R,
/(a;, y) = a;y on th ellipse 2a;2 + 3y2 = 1. O
8.H.25. Find the maximum of the function / : R2 -> R,
/(a;, y) = a;y on the ellipse a;2 + 2y2 = 1. O
I. Volumes, areas, centroids of solids
8.1.1. Find the volume of the solid which lies in the half-plane z > 0, the cylinder a;2 + y2 < 1, and the half-plane
a) z < x,
b) x + y + z < 0.
Proof. In Chapter 6, we dealt with two variables x, y only, but replacing the absolute value |y| with the norm ||y|| of the vector of parameters does not change the argumentation at all. Again, the main point is that the continuous real functions on compact sets are uniformly continuous.
Since partial derivative concerns only one of the variables, the rest of the theorem was proved in 6.3.14, too. □
8.2.2. Integration of multivariate functions. In the case of univariate functions, integration is motivated by the idea of the area under the graph of a given function of one variable. Consider now the volume of the part of the three-dimensional space which lies under the graph of a function z = f(x, y) of two variables, and the multidimensional analogues in general.
In chapter six, small intervals [x{, xi+1] were chosen of length Ax{ which divided the whole interval [a,b]. Then, their representatives & were selected, and the corresponding part of the area was approximated by the area of the rectangle with height given by the value /(Ci) at the representative, i.e. the expression /(Ci) Ax{.
In the case of functions of two variables, work with divisions in both variables and the values representing the height of the graph above the particular little rectangles in the plane.
The first thing to deal with is to determine the integration domain, that is, the region the function / is to be integrated over. As an example, consider the function z = f(x,y) = yj\ — x2 — y2, whose graph is, inside the unit disc, half of the unit sphere. Integrating this function over the unit disc yields the volume of the unit semi-ball.
The simplest approach is to consider only those integration domains M which are given by products of intervals, i.e. given by ranges x e [a, b] and y G [c, d]. In this context, it is called a multidimensional interval. If M is a different bounded set in R2, work with a sufficiently large area [a,b] x [c, d], rather than with the set itself, and adjust the function so that f(x, y) = 0 for all points lying outside M. Considering the above case of the unit ball, integrate over the set M = [-1,1] x [-1,1] the function
f(x,y) =
sj\-x2-i:
for a;2 + y2 < 1 otherwise.
Solution, a) The volume can be calculated with ease using cylindric coordinates. There, the cylinder is determined by
The definition of the Riemann integral then faithfully follows the procedure from paragraph 6.2.8. This can be done for an arbitrary finite number of variables.
Given an n-dimensional interval I and partitions into ki subintervals in each variable x{, select the partition of I into ki ■ ■ ■ kn small n-dimensional intervals, and write Axi1...in for their volumes. The maximum of the lengths of the sides of the multidimensional intervals in such a partition is called its norm.
549
CHAPTER 8. CALCULUS WITH MORE VARIABLES
the inequality r < 1; the half-plane z < x by z < rcostp, then. Altogether, we get
f1    rf     rr cos tp 2
V= I    /     / rdzd<pdr = —.
Jo J-fJo 3
b) We will reduce this problem to one that is completely analogous to the above part by rotating the solid around the 2-axis by the angle tt/4 (be it in the positive
or the negative direction).   Applying the rotation matrix
/V2/2   -V2/2 0\
V2/2 V2/2 0 , the original inequality x+y+ z < 0 V  0 0 1/
is transformed to \f2~x1 +z' < 0 in the new coordinates. Now, it is easy to express the integral that corresponds to the volume of the examined solid:
V = C fw2 f° /k       r dz dw dr =-. We
JO J -j    J—\/2rcosip ~ o
need not have computed the result as we did; instead, we could notice that the solid from part (a) differs only by homo-thety with coefficient \[2 in the direction of the y-axis. See also note 8.1.11. □
8.1.2.   Find the volume of the solid in R3 which is given by
x2 + y2 + z2 < 1, 3a2 + 3y2 > z2, x > 0.
Solution.
RlEMANN INTEGRAL
First, we should realize what the examined solid looks like. It is a part of a ball which lies outside a given cone (see the picture).
The best way to determine the volume is probably to subtract half the volume of the sector given by the cone from half the ball's volume (note that the volume of the solid does not change if we replace the condition x > 0 with z > 0 - the sector is cut either "horizontally" or "vertically", but always
Definition. The Riemann integral of a real-valued function / denned on a multidimensional interval I = [ai,&i] x [0-2, ^2] x • • • x [o.n, bn] exists if for every choice of a sequence of divisions E (dividing the multidimensional interval in all variables simultaneously), and the representatives £n...i„ of the little mutlidimensional intervals in the partitions, with the norm of the partitions converging to zero, the integral sums
always converge to the value
S :
f(xi, ...,xn)dx1... dxn,
independent of the selected sequence of divisions and representatives.
The function / is then said to be Riemann-integrable over /.
As a relatively simple exercise, prove in detail that every Riemann-integrable function over an interval I must be bounded there. The reason is the same RCti   as m the case of univariate functions: control the norms of the divisions used in the definition somewhat roughly.
The situation gets worse when integrating in this way over unbounded intervals, see more remarks in 8.2.6 below. Therefore, consider integration of functions over R™ mainly for functions whose support is compact, that is, functions which vanish outside a bounded interval /.
A bounded set M C R™ is said to be Riemann measurable3 if and only if its indicator function, denned by
Xm(xi,. .. ,xn) =
1 for (xi,... ,xn) e S 0   for all other points in I
is Riemann-integrable over R™.
For any Riemann-measurable set M and a function / defined at all points of M, consider the function / = xm / as a function denned on the whole R™. This function / apparently has a compact support. The Riemann integral of the function / over the set M is denned by
/ dxi... dxn
j dxi... dxn,
supposing the integral on the right-hand side exists.
8.2.3. Properties of Riemann integral. This definition of the integral does not provide reasonable instructions for computing the values of Riemann integrals. However, it does lead to the following basic properties of the Riemann integral (cf. Theorem 6.2.8):
3 Better to say "measurable via Riemann integration", the measure itself is commonly called the Peano-Jordan measure in the literature.
550
CHAPTER 8. CALCULUS WITH MORE VARIABLES
to halves). We will calculate in spherical coordinates.
x = rcos(ip) sin(ifj), y = r sin(</9) sm(ifj), z = r cos(ip),
p £ [0, 2ir), ip £ [0,7r),r £ (0,oo).
The Jacobian of this transformation R3   —>  R3 is
r2 sm{ij}).
First of all, let us determine the volume of the ball. As for the integration bounds, it is convenient to express the conditions that bind the solid in the coordinates we will work in. In the spherical coordinates, the ball is given by the inequality
x2 + y2 + z2 = r2 < 1.
First, let us find the integration bounds for the variable p. If we denote by ir^ the projection onto the (^-coordinate in the
spherical coordinates (irv (tp,t
p), then the image of
the projection irv of the solid in question gives the integration bounds for the variable p. We know that irv(ball) = [0,2ir) (the equation r2 < 1 does not contain the variable p, so there are no constraints on it, and it takes on all possible values; this can also easily be imagined in space).
Having the bounds of one of the variables determined, we can proceed with the bounds of other variables. In general, those may depend on the variables whose bounds have already been determined (although this is not the case here). Thus, we choose arbitrarily a po £ [0, 2ir), and for this po (fixed from now on), we find the intersection of the solid (ball) and the surface p = p0 and its projection ir^ on the variable ip. Similarly like for p, the variable ip is not bounded (either by the inequality r2 < 1 or the equality p = po), so it can take on all possible values, ip £ [0, it).
Finally, let us fix a p = po and a ip = ipo. Now, we are looking for the projection Trr(U) of the object (line segment) U given by the constraints r2 < 1, p = p0, ip = ip0 on the variable r. The only constraint for r is the condition r2 < 1, so r £ (0,1].
Note that the integration bounds of the variables are independent of each other, so we can perform the integration in any order. Thus, we have
koule
r2 sm(ip) dip dp dr = -tt.
Jo Jo Jo
Now, let us compute the volume of the spherical sector given by x2 + y2 + z2 < 1 and 3x2 + 3y2 > z2. Again, we
Theorem. The set of Riemann-integrable real-valued functions over a Riemann measurable domain M c Rn is a vector space over the real scalars, and the Riemann integral is a linear form there.
If the integration domain M is given as a disjoint union of finitely many Riemann-measurable domains Mit then f is integrable on M if and only if it is integrable on all Mit and the integral over a function f over M is given by the sum of the integrals over the individual subdomains Mi.
Proof. All the properties follows directly from the definition of the Riemann integral and the properties of convergent sequences of real numbers, just as in the case of univariate functions. Think out the details by yourselves. □
For practical use, rewrite the theorem into the usual equalities:
Finite additivity and linearity
Any linear combination of Riemann-integrable functions fi : I —> R, i = l,...,k (over scalars in R) is again a Riemann-integrable function, and its integral can be computed as follows:
• • • ,xn)+... + akfk(x1,. . .,xn))dx1.. .dxn
ai / fi(x1,...,xn)dx1...dxn+
----\-a-k J fk(xi, ■ ■ ■ ,xn) dx1 ... dxn.
Let Mi and M2 be disjoint Riemann-measurable sets, consider a function / :M1UM2^I, Then / is Riemann-integrable over both sets M{ if and only if it is integrable over its union, and
f(x1,... ,xn) dx1... dxn
MXUM2
f(xi,..., xn) dxi... dxn+
Mi
f(x1, ...,xn)dx1... dxn.
8.2.4. Multiple integrals. Riemann-integrable functions especially involve cases when the boundary of the integration domain M can be expressed step by step via continuous dependencies between the coordinates in the following way. The first coordinate x runs within an interval [a,b]. The interval range of the next coordinate can be defined by two functions, i.e. y £ [p(x), ip(x)], then the range of the next coordinate is expressed as z £ [q(x,y), ((x, y)], and so on for all of the other coordinates.
For example, this is easy in the case of a ball from the introductory example: for a; £ [—1,1], define the range for y as y £ [—Vl — a;2, Vl — x2]. The volume of the ball can then be computed by integration of the mentioned function
551
CHAPTER 8. CALCULUS WITH MORE VARIABLES
express the conditions in the spherical coordinates: r2 < 1, 3sm2(ip) > cos2(f/>), i. e., tg(ip) > Just like in the case of the ball, we can see that the variables occur independently in the inequalities, so the integration bounds of the variables will be independent of each other as well. The condition r2 < 1 implies r G (0,1]; from tg(ip) > we have ip G [0, j]. The variable p is not restricted by any condition, so p G [0, 27t].
sector = /     /   / ^sinipdipdrdp-Jo   Jo Jo
altogether,
V = H>all _ ^sector = i"71"
2-V3
-TT
3    " V3' We could also have computed the volume directly:
2 7r
r sin ip dip dr dp =
V =
io Jo In cylindric coordinates
vT
x = r cos(</3), y = rsin(p),
z = z
with Jacobian r of this transformation, the calculation of the volume as the difference of the two solids considered above looks as follows:
r2ir   rh   /•! _
V ■
:7t -
r dz dr dp = —=. o   Jo  Jo v3
Note that we cannot compute the volume of the solid directly in the cylindric coordinates. Thus, we must split it into two solids denned by the conditions r < \ and r > |, respectively.
V = Vi + V2
r2rr
V3r
rdzdr dp Jo   Jo Jo
f27V     r\     r\/l — r2
+ I      II r dz dr dp
o
o
7t
□
Another alternative is to compute it as the volume of a solid of revolution, again splitting the solid into the two parts as in the previous case (the part "under the cone" and the part "under the sphere". However, these solids cannot be obtained by rotating around one of the axes. The volume of the former
/, or integrate the indicator function of the ball, i.e. the function which takes one on the subset Mcl3 which is further denned by z G [— sj\ — x2 — y2, sj\ — x2 — y2].
The following fundamental theorem transforms the computation of a Riemann integral to a sequence of computations of univariate integrals (while the other variables are considered to be parameters, which can appear in the integration bounds as well). Notice, we could have denned the multiple integral directly via the one-dimensional integration, but we would face the trouble of ensuring the indepdence of the result on our way of describing M. The theorem reveals that the two approaches coincide and there are no unclear points left.
Multiple integrals
Theorem. Let \l '- " be a bounded set, expressed with the help of continuous functions ipi, rji
M = {(x1,... ,xn); xx G [a,b],x2 G [ip2{xi),V2(xi)},
.. . ,xn G [ipn(xi, ■ ■ ■ ,xn_1),rin(x1,... ,xn_i)]},
and let f be a function which is continuous on M. Then the Riemann integral of the function f over the set M exists and is given by the formula
f(xi,x2,...,xn)dxi...dxn= / I /
IM Ja \J-4j2{xi)
fnn(xi,...,xn-i) \ \
f{x1,x2, ...,xn) dxn\ ... dxzjdxx
where the individual integrals are the one-variable Riemann integrals.
Proof. Consider first the proof for the case of two variables. It can then be seen that there is no need of further ideas in the general case.
Consider an interval I = [a, b] x [c, d] containing the set M = {(x,y);x G [a,b],y G [tp(x),r](y)]} and divisions E of the interval I with representatives &j.
The corresponding integral sum is
i     ^   j '
where Ax{j is written for the product of the sizes Ax{ and Axj of the intervals which correspond to the choice of the representative ^.
Assume that the work is only with choices of representatives &j which all share the same first coordinate x{. If the partition of the interval [a, b] is fixed, and only the partition of [c, d] is refined, the values of the inner sum of the expression
552
CHAPTER 8. CALCULUS WITH MORE VARIABLES
part can be calculated as the difference between the volumes of the cylinder x2 + y2 < |, 0 < z <      and the cone's
approaches the value of the integral
part 3x2 + 3y2 < zz, 0 < z < 2
The volume of the latter
one is then the difference between the volumes of the solid that is created by rotating the part of the arc y = — x2), \ < x < 1 around the 2-axis and the cylinder x2 + y2 < \,
0 < z < ^.
V = Vi + V2
tt
~24~
8
+
AV3
+ I 7T
7T
(1 - r2) dr -
8.1.3. Calculate the volume of the spherical segment of the ball x2 + y2 + z2 = 2 cut by the plane z = \.
Solution. We iwll compute the integral in spherical coordinates. The segment can be perceived as a spherical sector without the cone (with vertex at the point [0,0,0] and the circular base z = 1, x2 + y2 = 1). In these coordinates, the sector is the product of the intervals [0, y% x [0, 2tt) x [0,7r/4]. We thus integrate in the given bounds, in any order:
2tt fj2
r2 sin(0) dö dr dp = - (V 2 - 1)tt.
ip(Xi
f(xi,y)dy,
which exists since the function f(xi,y) is continuous. In this way, a function is obtained which is continuous in the free parameter x{, see 8.2.1. Therefore, further refinement of the partition of the interval [a, b] leads, in the limit, to the desired formula
rb / rv(y)
Ja \JMx)
f(x,y)dy )dx
It remains to deal with the case of general choices of representatives of general divisions S. Since / is a continuous function on a compact set, it is uniformly continuous there. Therefore, if a small real number e > 0 is selected beforehand, there is always a bound S > 0 for the norm of the partitions, so that the values of the function / for the general choices &j differ by no more than e from the choices used above. The limit processes results in the same value for general Riemann sums Ss,$ as seen above.
Now, the general case can be proved easily by induction. In the case of n = 1, the result is trivial. The presented reasoning can easily be transformed for a general induction step, writing (x2,..., xn) instead of y, having x1 instead of x, and perceiving the particular little cubes of the divisions as (n — 1)-dimensional cubes Cartesian-multiplied by the last interval. In the last-but-one step of the proof, the induction hypothesis is used, rather than the simple one-dimensional integration. The final argument about uniform continuity remains the same. It is advised to write this proof in detail as an exercise. □
8.2.5. Fubini theorem. The latter theorem has a particularly f©^       simple shape in the case of a multidimensional interval M. Then all the functions in bounds for I integration are just the constant bounds from the
definition of M. But this means that the integration process can be carried out coordinate by coordinate in any order. We have exploited this behavior already in Chapter 6, cf. 6.3.13. In this way is proved the important corollary:4
In the end, we must subtract the volume of the cone. That is equal to ^ttR2H (where R is the radius of the cone's base and H is its height; both are equal to 1 in our case), so the total volume is
sector
- V2
' 3
5).
Guido Fubini (1907-1943) was an important Italian mathematician active also in applied areas of mathematics. Simple derivation of Fubini theorem builds upon the simple properties of Riemann integration and the continuity of the integrated function. Fubini, in fact, proved this result in a much more general context of integration, while the theorem just introduced was used by mathematicians like Cauchy at least a century before Fubini.
553
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The volume of a general spherical segment with height h in a ball with radius R could be computed similarly:
V = V*
sector ^cone
2tt /*arccos( — 0 jo
r2 sin(ö) dr d6 dip
- -Tr(2Rh- h2){R- h) = -^Trh2(3R-h).
fubini theorem
Theorem. Every continuous function f(xi,..., xn) on a multidimensional interval M = [ai, bi] x [a2,b2] x ... x [an, bn] is Riemann integrable on M, and its integral
f(x1,... ,xn)dx1... dxn
M
phx    rb2 rbn
I    ■■■       f(xi, ...,xn)dx1... dxn
is independent of the order in which the multiple integration is performed.
8.1.4.
□
Find the volume of the part of the cylinder ■? = 16 which lies inside the cylinder x2 + y2 = 16.
Solution. We will compute the integral in Cartesian coordinates. Since the solid is symmetric, it suffices to integrate over the first octant (interchanging x and —x does not change the equation of the solid; the same holds for y and for z). The part of the solid that lies in the first octant is given by the space under the graph of the function z(x,y) = Vl6 — x2 and over the quarter-disc x2 + y2 < 16, x > 0, y > 0. Therefore, the volume of the whole solid is equal to
V16-.
-.dydx = 128.
□
Remark. Note that the projection of the considered solid onto both the plane y = 0 and the plane z = 0 is a circle with radius 4, yet the solid is not a ball.
The possibility of changing the order of integration in multiple integrals is extremely useful. We have already taken advantage of this result, namely when studying the relation of Fourier transforms and convolutions, see paragraph 7.1.9.
8.2.6. Unbounded regions and functions. There is no simple concept of an improper integral for unbounded multivariate functions. The following example of multiple integration of an unbounded function is illustrative in this direction:
y
o (x + yf l
dy
dx = — 2
dx I dy = — -
1
x-y io {x + yf
The reason can be understood by looking at the properties of non-absolutely converging series. There, rearranging the summands can lead to an arbitrary result.
The situation is better if the Riemann integral of a bounded non-negative function j(x) > 0 with non-compact support over the whole R™ is calculated. Of course some extra information is needed on the decay of the function / for large arguments. For example, if / is Rieman integrable over each n-dimensional interval I and there is a universal bound
\f(x)\ dx<C
with a constant C independent of the choice of the n-dimensional interval /cl™, then we may define
f(x)dx= lim  / f(x)dx,
where Ir = {(xi,..., xn); \xj\ < r, j = l,...,n}. The resulting limit, if it exists, is bounded by the same constant C. In this case, the Fubini theorem is true in the form
f(x) dxi I ... dx.
f(x) dx =
8.2.7. Further remarks on integration. The Riemann i. integral of multivariate functions behaves even worse than in the case of functions of one variable It in the sixth chapter. Therefore, more sophisticated approaches to integrations have been developed. They are mainly based on the concept of the measure of a set. We consider this problem briefly now.
554
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.5. Find the volume of the part of the cylinder x2 +y2 = 4 bounded by the planes z = 0 and z = x + y + 2.
Z-
Solution. We will work in cylindric coordinates given by the equations x = rcos,{p), y = r sm{p), z = z. The Jacobian of this transformation is J = r. The solid can be divided into two parts: above and below the plane z = 0, whose volumes will be denoted by V\ and V2, respectively. Further, we can notice that one part of the solid with volume V\ is a pyramid with vertices [0,0,0], [0,0,2], [-2,0,0], [0, -2,0]. Thus, we will further split this solid (above z = 0) into two parts, whose volumes we will calculate separately.
on Lebesgue integration
As we shall see in 8.2.10, the Riemann integration of the indicator functions xm of sets M C R™ leads to a finitely additive measure. In probability theory in chapter 10, even elementary problems require a concept of measure which is additive over countable systems of disjoint sets. Having such a measure, measurable functions f can be defined by the condition that their preimages of bounded intervals, / ~1 ([a, b] ]), are measurable sets and the integral is built by approximation via such "horizontal strips", see the illustration. This is the ..    , ..
r    ' add a regular diagram
starting point of Lebesgue integration..
We omit further details here, but note the Riesz representation theorem5 saying that for each linear functional I (i.e. a linear mapping valued in R) on continuous functions with compact support on a metric space X, there is the unique measure (with certain regularity properties) such that the integral associated to this measure extends /. In the case of the Riemann integral I on functions on R™ this provides the Lebesgue measure and the Lebesgue integral.
Another point of view is the completion procedure for metric spaces. Consider the vector space X = 5c(Rn) of all continuous functions with compact support. It can be equipped with the lp norms, similar to the univariate case from the seventh chapter, i.e.
II/IIP =
t)\p dxi... dxr
i/p
for any 1 < p < oo. Since the Riemann integral is defined again in terms of partitions and the representative values, the properties of the norm can be verified in the same way as for univariate functions, using Holder's and Minkowski's inequalities.
There are the metrics || ||p on X. The general theory provides its completion X, unique up to isometry, and it can be shown that it is again a space of functions. The Lebesgue integral mentioned above defines exactly these norms. Hence the spaces of functions with Lebesgue integrable powers \f\p are obtained.
Vi-K
pyramid
-tt/2
4
[r sin ip + r cos ip + 2]r dr
pyramid 3
Further,
V, -V2
r2(sin(i,s) + cos(<£>)) + 2r dr dip = 8tt,
dc^$.2.8. Change of coordinates. When calculating integrals of univariate functions, the "subtitution method" is used as one of the powerful tools, cf. 6.2.5. iWNfc   The method works similarly in the case of func-%%f>y^-j— tions of more variables, when understanding its geometric meaning.
Recall and reinterpret the univariate case. There, the integrated expression f(x) dx infinitesimally describes the two-dimensional area of the rectangle whose sides are the (linearized) increment Ax of the variable x, i.e. the one-dimensional rectangle, and the value f(x). If the variable x is transformed by the relation x = u(t), then the linearized increment can be expressed with the help of the differential
so Vi + V2 = 4ir +
40
3 ■
□
Remark. During the calculation, we made use of the fact that integrating a function of two variables over an area in R2
Frigyes Riesz (1880-1956) was a famous Hungarian mathematician active in particular in functional analysis. He introduced this theorem in the special case of X being an interval in M" in 1909
555
CHAPTER 8. CALCULUS WITH MORE VARIABLES
yields the difference of the volume of the solid in R3 determined by the graph of the integrated function and lying above z = 0 and the one lying below 2 = 0.
8.1.6. Find the volume of the solid in R3 which is given by the intersection of the sphere x2 +y 2+z2 = 4 and the cylinder
x2 + y2 = 1.
z.
Solution. Thanks to symmetry, it suffices to compute the volume of the part that lies in the first octant. We will integrate in cylindric coordinates given by the equations x = r cos(p), y = rsm(ip), z = z with Jacobian J = r, and it is the space between the plane z = 0 and the graph of the function z = \/4 — x2 — y2 = V4 — r2. Therefore, we can directly write is as the double integral
V = 8 /  r\/4-r2drd(^= -(8-3^)^.
Jo    Jo 3
as
and so the corresponding contribution for the integral is given by
f(u(t))^dt.
Here one either supposes that the sign of the derivative u'(t) is positive, or one interchanges the bounds of the integral, so that the sign does not effect the result.
Intuitively, the procedure for n variables should be similar. It is only necessary to recall the formula for (change of) the volume of parallelepipeds. The Riemann integrals are approximated by Riemann sums, which are based on the n-dimensional volume (area) of small multidimensional intervals Ae^...^ in the variables, multiplied by the values of the function at the representative points £n...i„- If the coordinates are transformed by means of a mapping x = G(y), not only the function values / (G (£, 1...,n)) are obtained at the representative points £u...u„ = in the new co-
ordinate expression, but also the change of the volume of the corresponding small multidimensional intervals needs care.
Once again, this is the case of a linear approximation of a change, which is well known — the best linear approximation of G(y) is its derivative D1G(y), which is given by the Jacobi matrix of G, see 8.1.18. The change of the volume is then given (in absolute value) by the determinant of this matrix (see a discussion of this topic in chapter 4 devoted to analytic geometry and linear algebra, especially 4.1.22).
Summarizing, the formulation of the next theorem should not be surprising and its proof consists in formalization of the latter ideas. However, this needs some effort and so the proof is split into several steps.
Transformation of coordinates
Theorem. Let G(t) : R™ —> R™ be a continuously differen-tiable and invertible mapping, and write
t = (h,... ,tn),   x = (xu ... ,xn) = G(h, ...,tn).
Further let M = G(N) be a Riemann measurable sets, and f : M —> R a continuous function. Then, N is also Riemann measurable and
J f(x)dx1...dxn= ff(G(t))\det(D1G(t))\dt1...dtn. Jm Jn
□
8.1.7.   Find the volume of the solid in 1 the intersection of the sphere x2 + y2 + z raboloid z = x2 + y2.
3 which is given by 2 — 2 and the pa-
8.2.9. The invariance of the integral. The first thing to be verified is the coincidence of two definitions of volume of parallelepipeds (taken for granted in IS-Ti* the above intuitive explanation of the latter theorem). Volumes and similar concepts were dealt with in chapter 4 and a crucial property was the invariance of the concepts with respect to the choice of Euclidean frames of R™, cf. 4.1.22 on page 248, which followed directly from the expression of the volumes in terms of determinants. It is needed
556
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Solution. Once again, we will work in cylindric coordinates:
V ■
r dz dr dip ■■
6
□
8.1.8. Find the volume of the solid in R3 which is bounded by the elliptic cylinder 4x2 + y2 = 1 and the planes z = 2y and 2 = 0, lying above the plane 2 = 0.
		/ K
		
		
Solution. Thanks to symmetry, it is advantageous to work in
1
2
to show that the same result holds in terms of the Riemann integration as defined above. It turns out that it is easier to deal with invariance with respect to general invertible linear mappings^ : R™ -> R™.
Proposition. Let <P : R™ —> R™ be an invertible linear mapping and I C Rn a multidimensional interval. Consider a function f, such that fo^P is integrable on I. Then M = 9{T) is Riemann measurable, f is Riemann integrable on M and
f(xi, ■ ■ ■, xn) dxx... dxr,
= I det<Z>-| J (fo#)(y1,...,yn)dy1...dyn.
Proof. Each linear mapping is a composition of the elementary transformations of three types (see the discussion in chapter 2, in particular paragraphs 2.1.7 and 2.1.9).
The first one is a multiplication of one of the coordinates with a constant: <P(yi,. ..,yn) = (yi, a-Vi, Vn)- In this case | det <P\ = \a\. The second one consists of an exchange of two coordinates, i.e. for given 1 < i < j' < n, ^(yi,...,y„) = (y1,...,yj,...pyi,...pyn). The determinant of <P is —1 in this case. The third type of transformations is of the form &(yi,...,yn) = (yi, ■ ■ ■,yi + Hj,... ,yj,..., yn), with determinant one. Without loss of generality, i = 1 can be chosen in the first case and i = 1, j = 2, in the second case. Since the determinant of the composition of the mappings (i.e. the determinant of the product of the matrices) is the product of the individual determinants, it is enough to prove the proposition for all three special types of 9.
Express the right hand integrals for these three types of <P by means of the multiple integrals and Fubini theorem. Write I = [ai,&i] x ... x [an,bn] and x = 'P(y) for the transformation. In the first case (notice we can deal with the first variable and a > 0 without loss of generality),
det W\ J f(ayu y2, ■ ■ ■, yn) dy1 ... dyn =
a I ... (I   f(ay1,y2,. ■ ■ ,y„)dy^j ...dyn
/ f(x1px2,...,xn)dx1\...dx„
a a
f(x1,x2, ...,xn)dx1... dxn.
„„„„___._„        i      / \ i \ ltu The second case is even easier, since the order of integration
the coordinates x = ^rcos(p), y = rcos(p), z = z with ' &
, does not matter due to the Fubini theorem. The third case is
lacobian J = _r. The equation of the elliptic cylinder in similar tQ the &st Qne.
these coordinates is r2 = 1. Thus, the wanted volume is
V = I    I   r sia(p)-r dr dp
o Jo
TT rl
1 2 r2 sm((p) dr dp = /   — sin(</s) dtp = —.
o  3 3
□
I det ^1 J f(y1 + y2, y2,..., yn) dy1... dyn ■
/  f(yi + y2,y2,---,yn)dy1) ...dyn
rbn       / pb-+x-
j f(x1,x2,...,xn)dx1\...dxr,
'a1+x2
557
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.9. Find the volume of the solid in R3 which is bounded by the paraboloid 2x2 + y2 = z and the plane z = 2.
f(x1,x2,... ,xn)dx1... dx„
Solution. Similarly to the above problem, we choose "special" coordinates which respect the symmetry of the solid:
x = -^rcos(ip), y = rsm(ip), z = z with Jacobian J = -^r. The equation of the paraboloid in these coordinates is z = r2, so the volume of the solid is equal to
/•7T-/2    /*\/2    /*2 ^
V = 4 /      / —pr dz dr dp
Jo     Jo    Jr2 v2
= 2y/2 /     2r - r3 dr dp = 2^2 / dp
Jo    Jo Jo
= V2tt.
□
8.1.10.   Calculate the volume of the ellipsoid x2 + 2y2 +
3z2 = 1.
Solution. We will consider the coordinates
x   =     r cos(</s) sin(ö),
^rsin(i^)sin(6»), ^rcos(6»).
The corresponding Jacobian is      2 sin(0), so the volume is
V =
V6
r2 sin(ö) dr d6 dp =
3^
□
8.1.11. Remark. Note that if the transformation the coordinates is linear (and affine), then the space is deformed "uniformly". This means that the volume of an arbitrary solid is
The reader should check the details that the last multiple integral describes the image 'P(T). □
As a direct corollary of the proposition, the Riemann integral is invariant with respect to the Euclidean affine mappings. That is, the integral cannot depend on the choice of the orthogonal frame in the Euclidean R™.
8.2.10. Riemann measurable sets. It is necessary to understand how to recognize Riemann measurable domains M.
When denning the Riemann integral, a strict analogy of the lower and upper Riemann integrals for univariate functions can be considered. This means taking infima or suprema of the integrated function over the corresponding multidimensional intervals instead of the function values at the representatives in the Riemann sums. For bounded functions, there are well-defined values of the upper and lower integrals found in this way. If this is done for the indicator function xm of a fixed set M, the inner and outer Riemann measure of the set M is obtained. Evidently, the inner measure is the supremum of the areas given by the the (finite) sums of the volumes of all multi-dimensional intervals from the partitions which are inside M, and on the other hand, the outer measure is the inn-mum of the (finite) sums of the volumes of intervals covering M. It follows directly from the definition that a set M is Riemann measurable if and only if its inner and outer measures are equal.
The sets whose outer measure is zero are, of course, Riemann measurable. They are called measure zero sets or null sets. The finite additivity of the Riemann integral makes the measure finitely additive. Hence, a disjoint union of finitely many measurable sets is again a measurable set, and its measure is given by the sum of the measures of the individual sets in the union.
Consider the measurability of any given set M C I C R™ inside a sufficiently large multidimensional interval I. Consider the boundary DM, i.e. the set of all boundary points of M. For any partition E of I from the definition of the Riemann integral of xm, each of the intervals with non-trivial intersection with dM contributes to the upper integral but might not contribute to the lower integral. On the contrary, for every point in the interior MQ c M its interval contributes to both the same way as soon as the norm of the partition is small enough. This observation leads to the first part of the following claim:
Proposition. a bounded set M C Rn is Riemann measurable if and only if its boundary is of Riemann measure zero.
If M is a Riemann measurable set and G : M C Rn —> Rn is a continuously differentiable and invertible mapping, then G(M) is again Riemann measurable.
558
CHAPTER 8. CALCULUS WITH MORE VARIABLES
changed proportionally to the change of the volume of an infinitesimal volume element, which is the Jacobian. Therefore, if we consider the volume of the ball with a given radius r to be known, (in this case, r = 1), we can infer directly that the
volume of the ellipsoid is V
1 4„
75"
3Vě'
8.1.12. Find the volume of the solid which is bounded by the paraboloid 2x2 + 5y2 = z and the plane z = l.
Solution. We choose the coordinates
x    = 72rCOSM'
V   = Tgrsin^), z   = z.
The determinant of the Jacobian is -4==, so the volume is
V ■
2tt    rl rl
0     JO Jr
= dz dr dtp = —==. 10 2VTÖ
□
8.1.13. Find the volume of the solid which lies in the first octant and is bounded by the surfaces y2 + z2 = 9 and y2 = 3x.
Solution. In cylindric coordinates,
z-3 cos2 O)
10     Jo Jo
V ■
27
r dx dr dip = —it.
16
□
8.1.14. Find the volume of the solid in R3 which is bounded by the cone part 2x2 +y2 = (z—2)2,z > 2 and the paraboloid
2x2 +y2 = 8 - z.
Proof. The first claim is already verified . Since both G and G-1 are continuous, G maps internal points of M to internal points of G(M). To finish the proof, it must be verified that G maps the boundary dM, which is a set of measure zero, again to a set of measure zero.
Since every Riemann integrable set M is bounded, its closure M must be compact. It follows that G and all partial derivatives of its components are uniformly continuous on M, and in particular on the boundary dM.
Next, consider a partition E of an interval I containing dM and a fixed tiny interval J in a partition including a point t e dM. Write R = G(t) + D^^iJ - t). J is first shifted to the origin by translation, then the derivative of G is applied obtaining a parallelepiped. This is shifted back to be around G(t). By the uniform continuity of G and D1G, for each e > 0 there is a bound S for the norm of a partition for which
G(J) C G(t) + (1 + e)D1G{t)(J - t)
can be guaranteed. The entire image of J lies inside a slightly enlarged linear image of J by the derivative. Now, the outer measure a of the image G( J) satisfies:
a < (1 + e)n vol„ R = (1 + e)n | det G(t) vol„ J.
If [i is the upper Riemann sum for the measure of dM corresponding to the chosen partition, the outer measure of G(dM) must be bounded by (1 + e)n maxtGgM | det G(t)\fi. Finally, for the same e, the norm of the partition is bounded, so that [i < e, too. But then the outer measure is bounded by a constant multiple of (1 + e)ne, with the universal constant maxtGgM | det G(t)\. So the outer measure is zero, as required. □
A slightly extended argumentation as in the proof above leads to understanding that the Riemann integrable functions are exactly those bounded functions with compact support whose set of discontinuity points has (Riemann) measure zero.
8.2.11. Proof of Theorem 8.2.8. A continuous function / and a differentiable change of coordinates is under consideration. So the inverse G-1 is continuously differentiable, and the image G_1(M) = 7Y is Riemann measurable. Hence the integrals on both sides of the equality exist and it remains to prove that their values are equal.
Denote a composite continuous function by
g(tu...,tn) = /(G(ti,...,t„)),
and choose a sufficiently large n-dimensional interval I containing 7Y and its partition E. The entire proof is nothing more than a more exact writing of the discussion presented before the formulation of the theorem.
Repeat the estimates on the volumes of images from the previous paragraph on Riemann measurability. It is already known that the images G(Ii1...in) of the intervals from the
559
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Solution. First of all, we find the intersection of the given surfaces:
(z-2)2 = -z + 8, z>2;
therefore, z = 4, and the equation of the intersection is 2x2+y = 4. The substitution a; = -^r cos(<p),y = rsm(<p), z = z transforms the given surfaces to the form r2 = (z— 2)2, z > 2, and r2 = 8 — z, i. e., z = r + 2 for the former surface and z = 8 — r2 for the latter. Altogether, the projection of the given solid onto the coordinate p is equal to the interval [0, 2tt]. Having fixed a po e [q,2n], the projection of the intersection of the solid and the plane p = p0 onto the coordinate r equals (independently of po) the interval [0,2]. Having fixed both r0 and po, the projection of the intersection of the solid and the line r = ro, p = po, onto the coordinate z is equal to the interval [r0 + 2,8 — 7-q] . The Jacobian of the
considered transformation is J
V2
r, so we can write
V ■
2ir    r2 rS-r-
0     JO Jr+2
r   A   A   A 16y/2 —= az dr dp = -it.
V2 ^3
□
8.1.15. Find the volume of the solid which lies inside the cylinder y2 +z2 = 4 and the half-space a; > 0 and is bounded by the surface y2 + z2 + 2x = 16.
Solution. In cylindric coordinates,
V ■
2tt    r2 /.8-V-
0     JO JO
r dx dr dp = 287T.
□
8.1.16. The centroid of a solid. The coordinates (xt, yt,zt) of the centroid of a (homogeneous) solid T with volume V in R3 are given by the following integrals:
Xt = J J J xdxdydz,
Vt
y dx dy dz,
z da;dydz.
The centroid of a figure in R2 or other dimensions can be computed analogously.
8.1.17. Find the centroid of the part of the ellipse 3a;2 + 2y2 = 1 which lies in the first quadrant of the plane R2.
partition are again Riemann measurable sets. For each small part Ii 1... i n of the partition E, the integral of / over Jix...in = G(Ii1...in) certainly exists, too.
Further.ifthecenterf^...^ of the interval/^...^ isfixed, then the linear image of this interval
Ri
Gfe...,J+ß1G((,....J(4.
U
is obtained. This is an n-dimensional parallelepiped (note that the interval is shifted to the origin with the linear mapping given by the Jacobi matrix, and the result is then added to the image of the center).
If the partition is very fine, this parallelepiped differs only a little from the image Ji1 . By the uniform continuity of the mapping G, there is, for an arbitrarily small e > 0, a norm of the partition such that for all finer partitions
G(fn...2J + (1 + e)D1G(h,... tn)(Iil...iJ D Jll...lk.
However, then the n-dimensional volumes also satisfy
vol„(Jn...2J < (l + e)revoln(fi1...JJ
= (l+er|detG(fn...JJ|voln(/n...JJ.
Now, it is possible to estimate the entire integral: f(xu ...,xn)dx1... dxn = = E / f(x1,...,xn)dx1...dxn < V ( sup g(t))voln(Jil...iJ
<(l+e)reV( sup s(t))|detG(til...iJ|voln(Jil...i„).
■ — teh, i„
If e approaches zero, then the norms of the partitions approach zero too, the left-hand value of the integral remains the same, while on the right-hand side, the Riemann integral of g(t) | det G(t) | is obtained. Instead of the desired equality, the inequality:
f(x)dx1...dxn< / f(G(t))\det(D1G(t))\dt1...dtn
M JN
is obtained.
The same reasoning can be repeated after interchanging G and G-1, the integration domains M and 7Y, and the functions / and g. The reverse inequality is immediately obtained:
g(t)\det(D1G(t)) \ dh ... dtn
/(a;)|det(D1G(G-1(a;)))|
M
\d&t(D1G~1(x))\dx1...dxn f(x) dx1...dxn.
The proof is complete.
560
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Solution. First, let us calculate the volume of the given ellipse. The transformation x = -^x', y = -^y' with Jaco-
bian     leads to
v5
0 Jo
tt
dyda; = ——
V6jo Jo
1 rVl-x2
The other integrals we need can be computed directly in Cartesian coordinates x and y:
Tx =
xdydx= I     x\l---dx =
1   /"*   /l — 3i y/2
Therefore, the coordinates of the centroid are        ^2]. □
8.1.18. Find the volume and the centroid of a homogeneous cone of height h and circular base with radius r.
Solution. Positioning the cone so that the vertex is at the origin and points downwards, we have in cylindric coordinates that
V = 4 /       /    /   pdzdpdp = -irhr2. Jo    Jo J^p 3
Apparently, the centroid lies on the z-axis.    For the
z-coordinate, we get
cone
zdV=-
Y     rir/2 fr
0      Jo J^p
zpdzdpdp = —h.
Thus, the centroid lies \h over the center of the cone's base. □
8.1.19. Find the centroid of the solid which is bounded by the paraboloid 2a;2 + 2y2 = z, the cylinder (a;+l)2+y2 = 0, and the plane z = 0.
Solution. First, we will compute the volume of the given solid. Again, we use the cylindric coordinates (x = r ■ cos p, y = r ■ sin p, z = z), where the equation of the paraboloid is z = 2r2 and the equation of the cylinder reads r = — 2 cos(y>). Moreover, taking into account the fact that the plane x = 0 is tangent to the given cylinder, we can easily
8.2.12. An example in two dimensions. The coordinate jgt        transformations are quite transparent for the in-
tegral of a continuous function f(x,y) of two
dy'da;'
variables. Consider the differentiable transfor-- mation G(s,t) = (x(s,t),y(s,t)). Denoting f(x(s,t),y(s,t)),
dx dy ds~dt
dx dy ~dtds~
dsdt
f(x,y) dxdy -->G(N)
is obtained.
As a truly simple example, calculate the integral of the indicator function of a disc with radius R (i.e. its area) and the integral of the function f(t, 6) = cos(i) defined in polar coordinates inside a circle with radius j7r (i.e. the volume hidden under such a "cap placed above the origin", see the illustration).
First, determine the lacobi matrix of the transformation
. y:
D1G =
cos a   —r sin (
Uin0
Hence, the determinant of this matrix is equal to
det D1G{r, 9) = r(sin2 9 + cos2 6») = r.
Therefore, the calculation can be done directly for the disc S which is the image of the rectangle (r, 9) e [0, R] x [0, 2ir] = T. In this way the area of the disc is obtained:
f2"   fR fR 2
dxdy = I r drd9 = I   2irr dr = ttR .
s Jo   Jo Jo
The integration of the function / is very similar, using multiple integration and integration by parts:
<-2ir mt/2
j(x,y)dxdy= /     /     r cos r drd9 = tt2 — 2tt. s Jo Jo
In many real life applications, a much more general approach to integration is needed which allows for the dealing with objects over curves, surfaces, and their higher dimensional analogues. For many simple cases, such tools can be built now with the help of parametrization of such k-dimensional surfaces and employ the letter theorem to show the independence of the result on such a parametrization.
561
CHAPTER 8. CALCULUS WITH MORE VARIABLES
determine the bounds of the integral that corresponds to the
volume of the examined solid:
r^r  r~2 cos v r2r2 V = I      I I     rdzdr dp
o Jo
2 cos ip
2rA dr dp
8 cos irp = 37r,
where the last integral can be computed using the method of recurrence from 6.2.6.
Now, let us find the centroid. Since the solid is symmetric with respect to the plane y = 0, the y-coordinate of the centroid must be zero. Then, the remaining coordinates xt and zt of the centroid can be computed by the following integrals:
1
xT ■
V
1
Vjo 1
V .u
1
V .L
B
x dx dy dz
-2 cos ip
r2 cos p dz dr dp
'o
-2 cos ip
2r4 cos p dr dph
o
64 6 , 4 — cos pdp = --,
where the last integral was computed by 6.2.6 again. Analogously for the z-coordinate of the centroid:
1
-2 cos ip
zr cos p dz dr dp = —.
The coordinates of the centroid are thus [—1,0,
201
3'      9 >■
□
8.1.20. Find the centroid of the homogeneous solid in R3 which lies between the planes z = 0 and z = 2, bounded by the cones x2 + y2 = z2 and x2 + y2 = 2z2.
Solution. The problem can be solved in the same way as the previous ones. It would be advantageous to work in cylindric coordinates.
However, we can notice that the solid in question is an "annular cone": it is formed by cutting out a cone K1 with base radius 4 of a cone K2 with base radius 8, of common height 2.
The centroid of the examined solid can be determined by the "rule of lever": the centroid of a system of two solids is the weighted arithmetic mean of of the particular solids' centroids, weighed by the masses of the solids. We found out
These topics are postponed to the beginning of the next chapter where a more general and geometric approach is discussed.
3. Differential equations
In this section, we return to (vector) functions of one variable, denned and examined in terms of their instantaneous changes.
8.3.1. Linear and non-linear difference models. The concept of derivative was introduced in order to 1/ work with instantaneous changes of the exam-ined quantities. In the introductory chapter, difference equations based on similar concepts in relation to sequences of scalars were discussed. As a motivating introduction to equations containing derivatives of unknown functions, recall first the difference equations.
The simplest difference equations are formulated as yn+i = F(yn, n), with a function F of two variables. For example, the model describing interests of deposits or loans (this included the Malthusian model of populations) was considered. The increment was proportional to the value, yn+i = ayn, see 1.2.2. Growths by 5% is represented by a = 1.05. Considering continuous modeling, the same request leads to an equation connecting the derivative y'(t) of a function with its value
(1)
y\t)=ry(t)
with the proportionality constant r. Here, the instantaneous growth by 5% corresponds to r = 0.05.
It is easy to guess the solution of the latter equation, i.e. a function y(t) which satisfies the equality identically,
y(t) = Cert
with an arbitrary constant C. This constant can be determined uniquely by choosing the initial value yo = y(to) at some point to. If a part of the increment in a model should be given as a constant independent of the value y or t (like bank charges or the natural decrease of stock population as a result of sending some part of it to slaughterhouses), an equation can be used with a constant s on the right-hand side.
(2) y'(t)=r-y(t) + s.
The solution of this equation is the function
y(f) = Cert--.
r
It is a straightforward matter to produce this solution when it is realized that the set of all solutions of the equation (1) is a one-dimensional vector space, while the solutions of the equation (2) are obtained by adding any one of its solutions to the solutions of the previous equation. The constant solution y(t) = k for k = — ^ is easily found.
Similarly, in paragraph 1.4.1, the logistic model of population growth was created. Based on the assumption that the ratio of the change of the population size p(n + 1) — p(n) and its size p(n) is affine with respect to the population size
562
CHAPTER 8. CALCULUS WITH MORE VARIABLES
in exercise 8.1.18 that the centroid of a homogeneous cone is situated at quarter its height. Therefore, the centroids of both cones lie at the same point, and this points thus must be the centroid of the examined solid as well. Hence, the coordinates of the wanted centroid are [0,0, §]. □
8.1.21. Find the volume of the solid in R3 which is bounded by the cone part x2 + y2 = (z — 2)2 and the paraboloid
x2 + y2 = 4 — z.
Solution. We build the corresponding integral in cylindric coordinates, which evaluates as follows:
r2rr    rl ri-r2
V :
0     JO Jr+2
r dz dr dip = —it.
6
□
8.1.22. Find the volume of the solid in R3 which lies under the cone x2 + y2 = (z — 2)2, z < 2 and over the paraboloid
x2 + y2 = z-
2ir    r\ r2-r
r dz dr dp = —tt.
6
Solution.
V =
JO     JO Jr2
Note that the considered solid is symmetric with the solid from the previous exercise 8.1.21 (the center of the symmetry is the point [0,0, 2]). Therefore, it must have the same volume. □
8.1.23. Find the centroid of the surface bounded by the parabola y = 4 — x2 and the line y = 0. O
8.1.24. Find the centroid of the circular sector corresponding to the angle of 60° that was cut out of a disc with radius 1.
o
8.1.25. Find the centroid of the semidisc x2 + y2 = 1, y > 0.
o
8.1.26. Find the centroid of the circular sector corresponding to the angle of 120° that was cut out of a disc with radius 1.
o
8.1.27. Find the volume of the solid in R3 which is given by the inequalities z > 0, z — x < 0, and (x — l)2 + y2 < 1.
o
8.1.28. Find the volume of the solid in R3 which is given by the inequalities z > 0, z — y < 0. Q
8.1.29. Find the volume of the solid bounded by the surface
3x2 + 2y2 + 3z2 + 2xy - 2yz - Axz = 1.
itself. The model behaves similar as the Malthusian one for small values of the population size and to cease growing when reaching a limit value K. Now, the same relation for the continuous model can be formulated for a population p(t) dependent on time t by the equality
(3)
At)=p(t) (-£p(i)+r)
At the value p(t) = K for a (large) constant K, the instantaneous increment of the function p is zero, while for p(t) > 0 near zero, the ratio of the rate of increment of the population and its size is close to r, which is the (small) number expressing the rate of increment of the population in good conditions (e.g. 0.05 would again mean immediate growth by 5%).
It is not easy to solve such an equation without knowing any theory (although this type of equations will be dealt with in a moment). However, as an exercise on differentiation, it is easily verified that the following function is a solution for every constant C:
- K P[t) ~ 1 + CKe-rf
For the continuous and discrete versions of the logistic models, the values K = 100, r = 0.05, and C = 1 in the left hand illustration are chosen. The same 1.4.1 result occurs in the right hand illustration (i.e. with a = 1.05 and p\ = 1, as expected). The choice C = 1 yields p(0) = K/(1 + K) which is very close to 1 if K is large enough.
O
In particular, both versions of this logistic model yield quite similar results. For example, the left hand illustration also contains the dashed line of the graph of the solution of the equation (1) with the same constant r and initial condition (i.e. the Mathusian model of growth).
8.3.2. First-order differential equations. By an (ordinary) first-order differential equation, is usually meant (Mli^ the relation between the derivative y'(t) of a function with respect to the variable t, its value y(t), and the variable itself, which can be written in terms of some real-valued function F : R3 —> R as the equality
F(y'(t),y(t),t) = 0.
This equation resembles the implicitly defined functions y(t); however, this time there is a dependency on the derivative of the function y(t). We also often suppress the dependence of y = y(t) on the other variable t and write F(y', y, t) = 0 instead.
563
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.1.30. Find the volume of the part of R3 lying inside the ellipsoid 2x2 +y2 + z2 =6 and in the half-space x > 1. O
8.1.31. The area of the graph of a real-valued function f(x,y) in variables x and y. The area of the graph of a function of two variables over an area S in the plane xy is given by the integral
l + f2 + f2dxdy.
Considering the cone x2 + y2 = z2. find the area of the part of its lateral surface which lies above the plane z = 0 and inside the cylinder x2 + y2 = y. Solution. The wanted area can be calculated as the area of
the graph of the function z = \Jx2 + y2 over the disc K:
c2 — (y — I)2. We can easily see that
fx
x2 + y2
fy
x2 + y2'
so the area is expressed by the integral
JJK\J1 + fx + fydxdy = JJK^dxdy:
r dr dw =
^ 2
sin2 ip
□
8.1.32. Find the area of the parabola z = x2 + y2 over the disc x2 + y2 < 4. O
8.1.33. Find the area of the part of the plane x + 2y + z = 10 that lies over the figure given by (x — l)2 + y2 < 1 and y > x.
o
In the following exercise, we will also apply our knowledge of the theory of Fourier transforms from the previous chapter.
8.1.34. Fourier transform and diffraction. Light intensity is a physical quantity which expresses the transmission of energy by waves. The intensity of a general light wave is defined as the time-averaged magnitude of the Poynting vector, which is the vector product of mutually orthogonal vectors of electric and magnetic fields. A monochromatic plane wave spreading in the direction of the y-axis satisfies
I = ceo
where c is the speed of light and eo is the vacuum permittivity. The monochromatic wave is described by the harmonic function Ey = i/j(x, t) = A cos(cut — kx). The number A is the
If the implicit equation is solved at least explicitly with regard to the derivative, i.e.
y' = f(t,y)
for some function / : R2 —> R, it is clear graphically what this equation defines. For every value (t, y) in the plane, the arrow corresponding to the vector (1, f(t, y)), can be considered. That is the velocity with which the point of the graph of the solution moves through the plane, depending on the free parameter t.
For instance, the equation (3) in the previous subsection determines the following: (illustrating the solution for the initial condition as above).
Such illustrations should invoke the idea that differential equations define a "flow" in the plane, and each choice of the initial value (to,y(t0)) should correspond to a unique flow-line expressing the movement of the initial point in the time t. It can be anticipated intuitively that for reasonably behaved functions f(t,y) in the equations y' = f(t,y), there is a unique solution for all initial conditions.
8.3.3. Integration of differential equations. Before examining the conditions for existence and uniqueness of the solutions, we present a truly elementary method for finding the solutions. The idea,
mentioned briefly already in 6.2.14 on page 407, is to transforms the problem to ordinary integration, which usually leads to an implicit description of the solution.
Equations with separated variables
Consider a differential equation in the form
(1) y' = f(t) ■ g(y)
for two continuous functions of a real variable, / and g.
The solution of this equation can be obtained by integration, finding the antiderivatives
CM
This procedure reliably finds solutions y(t) which satisfy 9(y(t)) 7^ 0, given implicitly by the formula
(2) F(t) + C = G(y)
with an arbitrary constant G
564
CHAPTER 8. CALCULUS WITH MORE VARIABLES
maximal amplitude of the wave, u> is the angular frequency, and for any fixed t, the so-called wave length A is the prime period. The number k then represents the speed k = ^ at which the wave propagates. We have
I = ce0- [ Ei&t = ceQ- I   A2 cos2{ujt - k x) At T Jo T Jo
a91  c 1 + cos(2M -kxf) , = ce0Az- I -^4-—At
-ce0A2-\t +
1 r     sin(2(o;i — k x)) -
2uj
1 . 9I/ sin(2(o;r — k x)) — sin(2(—k x)) -, = -ce0Az- (t H--—---—-—
2 u    t{ 2uj
= ^ce0A2(l +
■   1 a2
= ^ce0A2
i(2(wt -kx))- sin(2(-fc x)), 2cjt '
The second term in the parentheses can be neglected since it is always less than ^7 = 577 < 10~6 f°r real detectors of light, so it is much inferior to 1. The light intensity is directly proportional to the squared amplitude.
A diffraction is such a deviation from straight-line propagation of light which cannot be explained as the result of a refraction or reflection (or the change of the ray's direction in a medium with continuously varying refractive index). The diffraction can be observed when a lightbeam propagates through a bounded space. The diffraction phenomena are strongest and easiest to see if the light goes through openings or obstacles whose size is roughly the wavelength of the light. In the case of the Fraunhofer diffraction, with which we will deal in the following example, a monochromatic plane wave goes through a very thin rectangular opening and projects on a distant surface. For instance, we can highlight a spot on the wall with a laser pointer. The image we get is the Fourier transform of the function describing the permeability of the shade - opening.
Let us choose the plane of the diffraction shade as the coordinate plane z = Q. Let a plane wave Aexp(ikz) (independent of the point (x, y) of landing on the shade) hit this plane perpendicularly. Let s(x,y) denote the function of the permeability of the shade, then the resulting waves falling onto the projection surface at a point (£, rj) can be described as the integral sum of the waves (Huygens-Fresnel principle) which have gone through the shade and propagate through the medium from all points (x, y, 0) (as a spherical wave) into the point (£,n,z):
Differentiating the latter equation (2) using the chain rule for the composite function G(y(t)) leads to j^y'it) — f(t), as required.
As an example, find the solution of the equation
y' = ty.
Direct calculation gives In \y(t)\ = \t2 + C with arbitrary constant G Hence it looks (at least for positive values of y) as
y(t) = eif2+c = D eif\
where D is an arbitrary positive constant. It is helpful to examine the resulting formula and signs thoroughly. The constant solution y(t) = 0 also satisfies the equation. For negative values of y, the same solution can be used with negative constants D. In fact, the constant D can be arbitrary, and a solution is found satisfying any initial value.
i i \ i \ \ \ \ s-w /11 11 1 1 1 l I \ i \ \ \ \ x-"-/ /////Ml I 1 \ i \ \ \ \ x-w ////// M I \ \ \ \ \ \\xw// III!!! \ \ \ \ v \\\y--//I till 1 1
\ \ \\\\\\<—//r111111
\ \ \ \ \ \N\\-j-//// / / / / /
\ \ \ \ WW-----■',-// / / / / ,'
\ X X X SNN^"-------^SSS / / / /
I \ \ \ \ '/ / I I I II ii! .11 1 \ i \ \ \ X-.-"/ / I I I I I ill
i\u \ \ \ \ w---//111 I I ill \ \ \ \ \ \ \ \ x-—-/ /////'/;
\ \ \ WW \\--//I I I I I I I
\ 4\ \ \ \ \\<---s/I ! I I ill I
\ rtWWW^-"//////// / \ \ \X XXXV-,---^ss/ / ijlt
/ / / ■ •-------• '.-.•-,V, X X
tiff/ / ----\\\ X \\ \ \
I   ill  I I \ \ \\\ \
I iii ! 11 r /s—\\a \ \ \ v\\ \
11111/1/s—s\\\ \ \\ \ Willi//*-\\\\\\\\\ I I I I I I / /--X \W\\\\\
II I I I I I I WW \ \ \ \ \ \ \\ ''1111!/ WW \ \ \ i \
/ r//ss^^-— / / / / r/ss-" 111/1/
I I I I I I / / w ////// Ww
I I I I I 11 //'
II / / / / / / w
I I I I I I I / w fill!////-'
----^W\XX X
—w\\ WW -wx\ \ \ \ \ \ WW W \ \ \ \ w\\\\ \ \ \ \ WWW \ \ I 1 -X \ \ \ I. \ i 1. i
• \ww\\\\
• \ \ \ \ \ \ \ 1 i
The illustration shows two solutions which demonstrate the instability of the equation with regard to the initial values: For every t0, if we change a small yo from a negative value to a positive one, then the behaviour of the resulting solution changes dramatically. Notice the constant solution y(t) = 0, which satisfies the initial condition y(to) =0.
Using separation of variables, the non-linear equation is easily solved from the previous paragraph which describes the logistic population model. Try this as an exercise.
First order linear equations. In the first chapter, we paid much attention to linear difference equations. Their general solution was determined in paragraph 1.2.2 on page 11. Although it is clear beforehand that it is a one-dimensional affine space of sequences, it is a hardly transparent sum, because all the changing coefficients need to be taken into account.
Consequently this can be used as a source of inspiration for the following construction of the solution of a general first-order linear equation
(1) y' = a(t)y + b(t)
with continuous coefficients a(t) and b(t).
First, find the solution of the homogeneous equation y' = a(t)y. This can be computed easily by separation of variables, obtaining
V
a(x) dx
y(t) = y0F(t,t0), F(t,s)
In the case of difference equations, the solution of the general non-homogeneous equation was "guessed". Then it was
565
CHAPTER 8. CALCULUS WITH MORE VARIABLES
proved by induction that it was correct. It is even simpler now, as it suffices to differentiate the correct solution to verify the statement, once we are told what the right result is:
r,) = A / /   s{x, y)e-jfe(^+TO) dx dy
J jr2
f.p/2 rq/2
iP{^rj)=Al       I e-lk^x+^dydx
J-p/2 J-q/2
rP/2 rq/2
iP(^v)=A /     e-tk^dx e-jfeTO J-p/2 J-q/2
A
dy
	p/2	' e-ikny'	q/2
-ik£ _	-p/2	—ikrj	-9/2
A
= Apq
2sin(fc£t;/2) 2sin(fc rjq/2)
sin(fc £p/2) sin(fc rjq/2) k£p/2 krjq/2
The graph of the function f(x) =       looks as follows:
		o,l-		
				
		Ia-		
		Jü2-		
-20		/ 1 \J -0,2-		
The graph of the function      rf) =
sin £ sm r
5 n
then does:
The solution of first-order linear equations
The solution of the equation (1) with initial values y(to) = yo is (locally in a neighbourhood of to) given by the formula
y(t)=yoF(t,to)+ [ F(t,s)b(s)ds,
J to
where F(t,s) = e£ ai-x) dx.
Verify the correctness of the solution by yourselves (pay proper attention to the differentiation of the integral where t is both in the upper bound and a free parameter in the integrand, cf. ??).
In fact, there is the general method called variation of constants which directly yields this solution, see e.g. the problem 8J.9. It consists in taking the solution for the homogenous equation in the form y(t) = cF(t,t0) and consider instead an ansatz for a solution to the non-homogeneous equation in the form y(t) = c(t)F(t,t0) with an unknown function c(t). Differentiating yields the equation c' = e~ -^o a^dx an(j integrating this leads to c(t) = Jl efs° <*)d* b(s)ds, i.e. y(t) = c(t)e^ a{x)dx as in the above formula. Check the details!
Notice also the similarity to the solution for the equations with constant coefficients explicitly computed in the form of convolution in ?? on the page ??, which could serve as inspiration, too.
As an example, the equation
y
l-xy,
can be solved directly, this time encountering stable behaviour, visible in the following illustration.
i I i
i I i
I I i
i t i
i i ',
i I i
1 I !
i i I 111.
1 t 1 i t
I I \ I 1
1 I I i [
I V I V '.
\ \ I I I
W W \
• / / / / / / s /
• I / / /: I; I: I: i; / /
■ s \ \ \ \
\ \ \ ', WW \ I I
\ \ \ I I I WW I
v\ \ \ \ WW 1 I 1
\\ \ \ t I l Uli I
, \ \ \ I I U I I 1
. \\\ \ I 1 I 11 I I I
-\\\ \ \ I t 11 1 I 1
, \ \ I 1 1 1 '. I 1
/-\ \\\ \ w u u i
\\\ \ U 1 I \ 1
\\ \ \ \ \ vv
/ s--\\\\ WWW //.-—-.s\\ \ \ \ \ \
/ rs--------V\ \ \ \ \ \
/ ///v-----x. \ \ \ \ \ \
/ / / / /—-^«^A \ \ \ ////////////// t 1 ! / / / / / / / / / / / f ! ! I !
And the diffraction we are describing:
8.3.5. Transformation of coordinates. The illustrations suggest that differential equations can be perceived as geometric objects (the "directional si^S geld of the arrows"), so the solution can be found by conveniently chosen coordinates. We return to this point of view later. Here are three simple examples of typical tricks as seen from the explicit form of the equations in coordinates.
566
CHAPTER 8. CALCULUS WITH MORE VARIABLES
MO'rad
- Orad
L-MO rad
Since limx^0 = 1, the intensity at the middle of the image is directly proportional to To = A2p2q2. The Fourier transform can be easily scrutinized if we aim a laser pointer through a subtle opening between the thumb and the index finger; it will be the image of the function of its permeability. The image of the last picture can be seen if we create a good rectangular opening by, for instance, gluing together some stickers with sharp edges.
J. First-order differential equations 8.J.I.   Find all solutions of the differential equation
V
(1 + cos2 x)
Solution. We are given an ordinary first-order differential equation in the form y' = j(x,y), which is called an explicit form of the equation. Moreover, we can write it as y' = fi(x)- J2(y) for continuous univariate functions f± and J2 (on certain open intervals), i. e., it is a differential equation with separated variables.
First, we replace y' with dy/dx and rewrite the differential equation in the form
Since
1+co2s x dx.
COSz X
I1-
+cos X
dx = J ■
+ 1 dx,
we can integrate using the basic formulae, thereby obtaining
(1) arcsiny = tg x + x + C, CeK.
However, we must keep in mind that the division by the expression sj\ — y2 is valid only if it is non-zero, i. e., only for y =^ ±1. Substituting the constant functions y = 1, y = — 1 into the given differential equation, we can immediately see that they satisfy it. We have thus obtained two more solutions,
We begin with homogeneous equations of the form
*' = /(?).
Considering the transformation z = | and assuming that t / 0, then by the chain rule,
z'(t) = ±{tyi(t)-y(t)) = 11(f(z)-z),
which is an equation with separated variables.
Other examples are the Bernoulli differential equations, which are of the form
y'(t) = f(t)y(t)+g(t)y(t)n,
where n / 0,1. The choice of the transformation z = yL~n leads to the equation
z'(t) = (l-n)y(t)-n(f(t)y(t)+g(t)yn) = (l-n)f(t)z(t) + (l-n)g(t),
which is a linear equation, easily integrated.
We conclude with the extraordinarily important Riccati equation. It is a form of the Bernoulli equation with n = 2, extended by an absolute term
y'(t) = f(t)y(t) + g(t)y(t)2 + h(t).
This equation can also be transformed to a linear equation provided that a particular solution x (t) can be guessed. Then, use the transformation
1
y(t)-x(ty
Verify by yourselves that this transformation leads to the equation
z\t) = -(f{t) + 2x(t)g(t))z(t)-g(t).
As seen in the case of integration of functions (the simplest type of equations with separated variables), the equations usually do not have a solution expressible explicitly in terms of elementary functions.
As with standard engineering tables of values of special functions, books listing the solutions of basic equations are compiled as well.6 Today, the wisdom concealed in them is essentially transferred to software systems like Maple or Math-ematica. Here, any task about ordinary differential equations can be assigned, with results obtained in surprisingly many cases. Yet, explicit solutions are not possible for most problems.
For example, the famous book Differentialgleichungen reeller Funktionen, Akademische Verlagsgesellschaft, Leipzig 1930, by E. Kamke, a German mathematician, contains many hundreds of solved equations. They appeared in many editions in the last century.
567
CHAPTER 8. CALCULUS WITH MORE VARIABLES
which are called singular. We do not have to pay attention to the case cos x = 0 since this only loses points of the domains (but not any solutions).
Now, we will comment on several parts of the computation. The expression y' = dy/dx allows us to make many symbolic manipulations. For instance, we have
dz _ dy_ _       dz_ 1 _ dx_
dy    dx       dx"1 dy '
The validity of these two formulae is actually guaranteed by the chain rule theorem and the theorem for differentiating an inverse function, respectively. It was just the facility of the manipulations that inspired G. W. Leibniz to introduce this notation, which has been in use up to now. Further, we should realize why we have not written the general solution (1) in the suggesting form
8.3.6. Existence and uniqueness. The way out of this is numerical methods, which try only to approximate the solutions. However, to be able to use them, good theoretical starting points are still needed regarding existence, uniqueness, and stability of the solutions.
We begin with the Picard-Lindeldf theorem:
Existence and uniqueness of the solutions of ODEs
Theorem. Consider a function f(t, y) : R2 —> R with continuous partial derivatives on an open set U. Then for every point (to,yo) £ U £ R2, there exists the maximal interval I = [to — a, t0 + b], with positive a, b £ R, and the unique function y(t) : / —> R which is the solution of the equation y' = f(t, y) on the interval I.
(2)
y = sin (tg x + x + O) ,    C £
As we will not mention the domains of differential equations (i. e., for which values of x the expressions are well-defined), we will not change them by "redundant" simplifications, either. It is apparent that the function y from (2) is denned for all x £ (0,7r) \ {t/2}. However, for the values of x which are close to 7r/2 (having fixed C), there is no y satisfying (1). In general, the solutions of differential equations are curves which may not be expressible as graphs of elementary functions (on the whole intervals where we consider them). Therefore, we will not even try to do that. □
Proof. If a differentiable function y(t) is a solution of an equation satisfying the initial condition y(to) =t0, then it also satisfies the equation
y(t) = Vo+     y'(s)ds = y0+ f(s,y(s))ds,
where the Riemann integrals exist due to the continuity of / and hence also y'. However, the right-hand side of this expression is the integral operator
L(y)(t) = y0 + / f(s,y(s))ds
J t0
8.J.2.   Find   the   general   solution   of  the equation
y' = (2 - y) tg x.
Solution. Again, we are given a differential equation with separated variables. We have
dy dx dy
= (2 - y) tg x, sin a;
■ dx,
y — 2     cos x — In | y — 2 | = -In | cosa; -ln|C|, C/0.
Here, the shift obtained from the integration has been expressed by In | C |, which is very advantageous (bearing in mind what we want to do next) especially in those cases when we obtain a logarithm on both sides of the equation. Further, we have
acting on functions y. Solving first-order differential equations, is equivalent to finding fixed points for this operator L, that is, to find a function y = y(t) satisfying L(y) = y.
On the other hand, if a Riemann-integrable function y(t) is a fixed point of the operator L, then it immediately follows from the fundamental theorem of calculus that y(t) satisfies the given differential equation, including the initial conditions.
It is easy to estimate how much the values L(y) and L(z) differ for various functions y(t) and z(t). Since both partial derivatives of / are continuous, / is itself locally Lipschitz. This means that restricting the values VJ 1 (t, y) to a neighbourhood U of the point (to, yo) with compact closure, there is the estimate
\f(t,y)-f(t,z)\<C\y-z\,
with some constant C depending only on U. This immediately leads to the following bound (for the sake of simplicity, t > t0, but the final conclusion works for t < t0 the same
568
CHAPTER 8. CALCULUS WITH MORE VARIABLES
way)
In | y — 2 | =ln|Ccosa;|, C^O, 12/ — 2 | = |Ccosx|, C^O, y-2 = Ccosa;, C^O,
where we should write ±C (after removing the absolute value). However, since we consider all non-zero values of C, it makes no difference whether we write +C or — C. We should pay attention to the fact that we have made a division by the expression y — 2. Therefore, we must examine the case y = 2 separately. The derivative of a constant function is zero, so we have found another solution, y = 2. However, this solution is not singular since it is contained in the general solution as the case C = 0. Thus, the correct result is
y = 2 + Ccosa;,   C e R. □
8.J.3.   Find the solution of the differential equation
(1 + ex) yy' = ex which satisfies the initial condition y(0) = 1. Solution. If the functions / : (a, 6) —> R and g : (c, d) —> R are continuous and g(y) ^ 0, y e (c,d), then the initial problem
y' = f(x)g(y),   y(x0) = yo has a unique solution for any x0 e (a, b), y0 e (c, d). This solution is determined implicitly as
s/O) x
f ifc = Jf(t)dt.
yo xo
In practical problems, we first find all solutions of the equation and then select the one which satisfies the initial condition.
Let us compute:
(1 + ex)ydy/dx = ex, ex
ydy =
1 + ex
■ dx.
V
= ln(l + ex) + ln|C,|,   C ^ 0,
^=ln(C[l + e*]),   0 0. The substitution y = 1, x = 0 then gives i=ln(C-2),   i.e.   C = ^. We have thus found the solution
iL.
2
l. e.,
In ( 4 [1 + e*]
21n(^ [l + e*
(L(y)-L(z))(t)\
f(s,y(s)) - f(s,z(s))ds
<
\f(s,y(s)) - f(s,z(s))\ds
<C     \y(s) - z(s)\ds
J to
< C( max \y(s) — z(s) |) |i —
to ^ s<t
= D( max \y(s) - z(s)|),
to<s<t
where the constant D comes from substituting the maximum of \t — t0 \ on U.
If the operator L is viewed as an operator on a metric space of continuous functions on a compact interval with the max norm, this yields
<£>||y-*||.
Some further restrictions on the choice of U and the considered functions y and z are required, in order to make the constant D smaller than one. Then the Banach fixed point theorem, based on the notion of a contraction, can be applied. See 7.3.9 on the page 493. At the same time, the operator must leave the chosen space of functions y invariant, i.e. the images L(y) are also there.
To begin, choose e > 0 and S > 0, both small enough ^\ so that [t0 - 5, t0 + S] x [y0 - e, y0 + e] = V C U, and consider only those functions y(i) which satisfy for J = [to —5, t0 +S] the estimate maxtgj \y(t)—yo\ < e. The uniform continuity of f(t, y) on V ensures that fixing e and further shrinking S, implies
max|L(y)(/j) - y0 \ < e.
Finally, the above estimate for \\L(y) — L(z)\\ shows that if S is decreased sufficiently further, then the latter constant D becomes smaller than one, as required for a contraction. At the same time, L maps the above space of functions into itself.
However, for the assumptions of the Banach contraction theorem, which guarantees the uniquely determined fixed point, completeness of the space X of functions on which the operator L works is needed.
Since the mapping f(t, y) is continuous, there follows a uniform bound for all of the functions y(t) considered above and the values t > s in their domain:
\L(y)(t) - L(y)(s)\ < J   \f(s,y(s)\ds < A\t - s\
with a universal constant A > 0. Besides the conditions mentioned above, there is a restriction to the subset of all equicontinuous functions in the sense of the Definition 7.3.15. According to the Arzela-Ascoli Theorem proved in the same paragraph at the page ??, this set of continuous functions is already compact, hence it is a complete set of continuous functions on the interval.
569
CHAPTER 8. CALCULUS WITH MORE VARIABLES
on a neighborhood of the point [0,1] where y > 0. 8.J.4.   Find the solution of the differential equation
□
x+1
which satisfies y(0) = 1.
Solution. Similarly to the previous example, we get
dy dx y2 + 1 ~ x + V arctany = In | x + 11 + C, CeK.
The initial condition (i. e., the substitution x = 0 and y = 1) gives
arctan 1 = In 111 + C,   i. e.,   C = f. Therefore, the solution of the given initial problem is the function
y(x) = tg (In    +    + f) on a neighborhood of the point [0,1]. □
8.J.5. Solve
(1) V = 2x + 2y-Y
Solution. Let a function / : (a, 6) x (c, d) —> R have continuous second-order partial derivatives and f(x,y) ^ 0, x G (a,b), y G (c,d). Then, the differential equation y' = f(x,y) can be transformed to an equation with separated variables if and only if f(x,y) fy(x,y)
fL(x,y) f"y(x,y)
With a bit of effort, it can be shown that a differential equation of the form y' = f(ax + by + c) can be transformed to an equation with separated variables, and this can be done by the substitution z = ax + by + c. Let us emphasize that the variable z replaces y.
We thus set z = x + y, which gives z' = 1 + y'. Substitution into (1) yields
z>-l=Z+l
0,    x G (a,b), y G (c, d).
dz dx
Iz-Y
^- + 1,
2z-l
dz dx 1 '
Tz
3z
2 l.i
-z--\n\z\=x + C,
I z — j In I Cz I = x,
2z-Y dz = 1 dx,
C GR,
c^o.
Therefore, there exists a unique fixed point y{t) of this contraction L by the Theorem 7.3.9. This is the solution of the equation.
It remains to show the existence of a maximal interval I = (to — a, t0 + b). Suppose that a solution y(t) is found on an interval (ío, í i), and, at the same time, the one-sided limit y1 = limt^tl  y(t) exists and is finite.
It follows from the already proven result that there exists a solution with this initial condition (ii, yi), in some neighbourhood of the point t±. Clearly, it must coincide with the discussed solution y (t) on the left-hand side of t1. Therefore, the solution y (t) can be extended on the right-hand side of 11.
There are only two possibilities when the extension of the solution behind t\ does not exist: either there is no finite left limit y(t) aXt1, or the limit yi exists, yet the point (ii, yi) is on the boundary of the domain of the function /. In both cases, the maximal extension of the solution to the right of io is found.
The argumentation for the maximal solution left of io is analogous. □
8.3.7. Iterative approximations of solutions. The proof of the previous theorem can be reformulated as an iterative procedure which provides approximate solutions using step-by-step integration. Moreover, an explicit estimate for the constant C from
the proof yields bounds for the errors.
Think this out as an exercise (see the proof of Banach fixed-point theorem in paragraph 7.3.9). It can then be shown easily and directly that it is a uniformly convergent sequence of continuous functions, so the limit is again a continuous function (without invoking the complicated theorems from the seventh chapter).
Picard's approximations Theorem. The unique solution of the equation
y' = f(t,y)
whose right-hand side f has continuous partial derivatives can be expressed, on a sufficiently small interval, as the limit of step-by-step iterations beginning with the constant function (Picard's approximation):
yo(t)=yo,   Vn+i(t) = L(yn), n = 1,....
It is a uniformly converging sequence of differentiable functions with differentiable limit y(t).
Only the Lipschitz condition is needed for the function /, so the latter two theorems are true with this weaker assumption as well. It is seen in the next paragraph that continuity of the function / guarantees the existence of the solution. Yet it is insufficient for the uniqueness.
8.3.8. Ambiguity of solutions. We begin with a simple example. Consider the equation
y' =
570
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Now, we must get back to the original variable y in one of these forms. The general solution can be written as
%x+ %y - l]n\x + y\ = x+ C,    C G R,
i. e.,
x - 2y + In | x + y | = C,   C G R.
At the same time, we have the singular solution y = —x, which follows from the constraint z =^ 0 of the operations we have made (we have divided by the value 3z). □
8.J.6.   Solve the differential equation
xy' + y In x = y In y.
Solution. Using the substitution u = y/x, every homogeneous differential equation y' = f (y/x) can be transformed to an equation (with separated variables)
u' = \ (f(u) - u),   i.e.   u'x + u = f(u).
The name of this differential equation is comes from the following definition. A function / of two variables is called homogeneous of degree k iff f(tx,ty) = tkf(x,y). Then, a differential equation of the form
P(x, y) dx + Q(x, y) dy = 0
is a homogeneous differential equation iff the functions P and Q are homogeneous of the same degree k.
For instance, we can discover that the given equation
x dy + (y In x — y In y) dx = 0
is homogeneous. Of course, it is not difficult to write it explicitly in the form
y' = 2 In 2.
^ XX
The substitution u = y/x then leads to
du dx
u'x + u = u lnw, x = u (lnu — 1) ,
Separating the variables, the solution is
y(t) = \(t + c)\
for positive values y, with an arbitrary constant C and t + C > 0. For the initial values (to, yo) with yo 0, this is an assignment matching the previous theorem, so there is locally exactly one solution. The solution must apparently remain non-decreasing, hence for negative values yo. the solution is the same, only with the opposite sign and t + C < 0.
However, for the initial condition (to,yo) = (to, 0), there is not only the already discussed solution continuing to the left of to and to the right, but also the identically zero solution y(t) = 0. Therefore, these two branches can be glued arbitrarily (see the diagram, where the thick solution can be continued along the t axis and branch along the parabola at any value t.)
/ / / / / / / /
/ / / / / / / /
/ / / / / / / / /
/ / / / / / / /
/ / / / / / / /
/ / / / / / / / /
/ ////// / /
S S S S X / ,
S S   ' S /? s
////// /
/ / / / / / /
/ / / 4 / / /
/ / / / / / I f / /
/ / / / / / / /
/ /
/ /
/ /
/ /
/ /
/ /
/ /
/ / / / / / / / / / / /
du dx u (lnu — 1)      x '
Nevertheless, the existence of a solution is guaranteed by the following theorem, known as Peano existence theorem:
Theorem. Consider a function f(t,y) : R2 —> R which is continuous on an open set U. Then for every point (to,yo) £ U D R2, there exists a solution of the equation
y' = f(t,y)
locally in some neighbourhood oft0.
Proof. The proof is presented only roughly, with the details left to the reader.
We construct a solution to the right of the j&d:^ initial point to. For this purpose, select a small step h > 0 and label the points
tk = t0 + kh,   A; = 1,2,....
The value of the derivative f(to,yo) of the corresponding curve of the solution (t,y(t)) is denned at the initial point (to, yo), so a parametrized line with the same derivative can be substituted:
V(o)(t) = yo + f(to,yo)(t-t0).
Label yi = y(0) (ti). Construct inductively the functions and points
y(k)(t) = yk + f(xk,yk)(t - tk),   yk+1 = y{k)(tk+1). Now, define yh(t) by gluing the particular linear parts, i.e., adpicmrem
yh(t) = y(k)(t) iff e [kh,(k + \)h].
571
CHAPTER 8. CALCULUS WITH MORE VARIABLES
where u(lnu — 1) 0. Using another substitution, namely t = In u — 1, we can integrate
du f dx
u (lnu — 1)     J x dt      f dx
In I i I = In I a; I +ln|C|, C/0, In I lnu -1| = In I Ca; I, C^O, lnU _ i = Cx, C^O,
y
x
y = a;e"
In-
Cx + 1, C^O,
The excluded cases u = 0 and In u = 1 do not lead to two more solutions sinceu = 0 implies y = 0, which cannot be put into the original equation. On the other hand, In u = I gives y/x = e, and the function y = ex is clearly a solution. Therefore, the general solution is
y = xeCx+1, CeR.
□
8.J.7. Compute
V
i _ 4x+3iy+l
3x+2j/+l'
Solution. In general, we are able to solve every equation of the form
ax + by + c
(1)
, = f ( ax + by + c \ V     J\Ax + By + c)
Ax + By + C, If the system of linear equations
(2) ax + by + c = 0,   Ax + By + C = 0
has aunique solution x0, y0, then the substitution u = x—x0, v — V — Vo transforms the equation (1) to a homogeneous equation
dv _       £ 1  au-T-bv \
du ~ J  \Au+Bv) '
If the system (2) has no solution or has infinitely many solutions, the substitution z = ax + by transforms the equation (1) to an equation with separated variables (often, the original equation is already such).
In this problem, the corresponding system of equations
4x + 3y + 1 = 0,   3a; + 2y + 1 = 0
has a unique solution xq = — 1, yo = 1- The substitution u = x+1, v = y — 1 then leads to the homogeneous equation
This is a continuous function, called the Euler's approximation of the solution.
It "only" remains to prove that the limit of the functions A yh for h approaching zero exists and is a solution. For this, one must observe (as done already in the proof of the theorem on uniqueness and existence of the solution) that f(t,y) is uniformly continuous on a sufficiently small neighbourhood U where the solution is sought. For any selected e > 0, a sufficiently small S such that \ j(t,y) — f(s,z)\ < e, exists whenever \\(t-s,y-z)\\ <S.
Especially, all functions yh are in the set of uniformly continuous functions on a sufficiently small interval. By the Arzelä-Askoli theorem (see paragraph 7.3.15 on page 500), the constructed continuous functions yh are all in a compact set of functions. So there exists a sequence of values hn —> 0 such that the corresponding sequence of functions yhn converges uniformly to a continuous function y(n). Write yn(t) = yh„ (t), i.e. yn -> y uniformly.
For each of the continuous functions yh, there are only finitely many points in the interval [to, t] where it is not dif-ferentiable, so
yn(t) = yo+    y'n(s)ds-
J to
On the other hand, the derivatives on the particular intervals are constant, so (here, k is the largest such that to + khn < t, while yj and tj are the points from the definition of the function yhn)
j-tj+1 j-t Vn(t) = yo + 22 /      f(tj,yj)ds+ f(tk,yk)ds.
j = 0       3 k
Instead, the equation
Vn(t)=yo+ f(s,yn(s))ds ■ho
is wanted, but the difference between this integral and the last two terms in the previous expression is bounded by the possible variation of the function values f(t, y) and the lengths of the intervals. By the universal bound for f(t, y) above, the last integral can be used instead of the actual values in the limit process linin^oo yn(t), thereby obtaining
y(t)= lim(y0+ / f(s,yn(s))ds)
■J to
= yo+ / ( lim f(s,yn(s)))ds
dv du
4u+3v ' 3u+2v '
= yo + / f(s,y(s))ds,
J to
where the uniform convergence yn(t) —> y(t) is employed. This proves the theorem. □
572
CHAPTER 8. CALCULUS WITH MORE VARIABLES
which can be solved by further substitution z = v/u. We thus obtain
4 + 3z
z u + z = —
dz du
3 + 2z' 2z2 + 6z + 4
2z + 3
3 + 2z ■ dz = —
du u
2z2 + 6z + 4 provided z2 + 3z + 2 ^ 0. Integrating, we get
1 I 9
- In I z2 + 3z + 2 I
- In I u   + In I CI
■In
In
(z2 + 3z + 2)u2 J =ln|C|, (z2 + 3z + 2) u2 I =lnC2, (z2 + 3z + 2) u2 = ±C2,
C^O,
C^O, C^O, C/0.
We thus have
(z2 + 3z + 2) u2 = D,   D 0 and returning to the original variables,
- + 3-+2]uz = D
v2 + 3vu + 2u2 = D,
(y - l)2 + 3(y - l)(a; + 1) + 2(x + l)2 = D, 0.
Making simple rearrangements, the general solution can be expressed as
(x + y)(2x + y + l) = D, fl/0. Now, let us return to the condition z2 + 3z + 2 ^ 0. It follows from z2 + 3z + 2 = 0 that z = — 1 or z = —2, i. e., w = —u or D = —2u. For w = —u, we have a; = u — 1 and y = v + 1 = —u + 1, which means that y = —a;. Similarly, for d = — 2u, we have y = — 2u + 1, hence y = —2a; — 1. However, both functions y = —x, y = —2x — 1 satisfy the original differential equations and are included in the general solution for the choice D = 0. Therefore, every solution is known from the implicit form
(x + y) (2x + y + 1) = D,   D e R.
□
8.J.8.   Find the general solution of the differential equation
(a;2 + y2) dx — 2xy dy = 0.
Solution. For y ^ 0, simple rearrangements lead to
■ _        _ i+(f)2
V
2xy
2*
Using the substitution u = y/x, we get to the equation
8.3.9. Coupled first-order equations. The problem of finding the solution of the equation y' = /(a;,y)can also be viewed as looking for a (parametrized) jtdrz^ curve (x(t),y(t)) in the plane where the parametrization of the variable x(t) = t is fixed beforehand. If this point of view is accepted, then this fixed choice for the variable x can be forgotten, and the work can be carried out with an arbitrary (finite) number of variables.
In the plane, for instance, such a system can be written in the form
x' = f(t, x, y),    y'(t) = g(t, x, y)
with two functions /, g : R3 —> R.
A simple example in the plane might be the system of equations
x' = -y,   y' = x.
It is easily guessed (or at least verified) that there is a solution of this system,
x(t) = Rcost,   y(t) = Rsvut,
with an arbitrary non-negative constant R, and the curves of the solution are exactly the parametrized circles with radius R.
In the general case, the vector notation of the system can be used in the form
x' = f(t,x)
for a vector function a; : R —> R™ and a mapping / : Rn+1 —> R™. The validity of the theorem on uniqueness and existence of the solution to such systems can be extended:
Existence and uniqueness for systems of ODEs
U X + u
Theorem. Consider functions f(t, x±,..., xn) : Rn+1 —> R, i = 1,..., n, with continuous partial derivatives. Then, for every point (to, y) G Rn+1, y = (ci,..., cn), there exists a maximal interval [t0 —a, t0 +b], with positive numbers s,kl, and a unique function x(t) : R —> Rn which is the solution of the system of equations
X\ — fl (t, Xi, ..., xn)
Xn      jnif? Xi, . . . , Xn)
with the initial condition x(t0) = y, i.e.
Xl(t0) = c1,... ,xn(t0) = cn.
Proof. The proof is almost identical to the one of the jji,, existence and uniqueness of the solution for a single P?^> equation with a single unknown function as shown r\4j^; in Theorem 8.3.6. The unknown function x(t) = 1 (x\(t),... ,xn(t)) is a curve in R™ satisfying the given equation, so its components Xi (t) are again expressed in terms of integrals
573
CHAPTER 8. CALCULUS WITH MORE VARIABLES
For u^il and D = — 1/C, we have
du _ 1 + u2 - 2u2 dx 2u '
2u dx ■ du = —,
1 — w
-In I 1 -u2 I = In I rr I +ln|C|,    C 0,
In-
= In I Cx I, C/0,
1
1 -u2
l = Cx\l-D
Cx, C^O,
C^O,
y
\-a—, D^O,
x xz -Dx = x2 -y2, D^O.
The condition u = ±1 corresponds to y = ±x. While y = 0 is not a solution, both the functions y = x and y = —x are, solutions and can be obtained by the choice D = 0. The general solution is thus
y2 = x2 + Dx,   D e R. □
8.J.9. Solve
_2y_
Solution. The given equation is of the form y' = a(x)y + b(x), i. e., a non-homogeneous linear differential equation (the function b is not identically equal to zero). The general solution of such an equation can be obtained using the method of integration factor (the non-homogeneous equation is multiplied by the expression e~ J"a^ dx) or the method of variable separation (the integration constant that arises in the solution of the corresponding homogeneous equations is considered to be a function in the variable x). We will illustrate both of these methods on this problem.
As for the former method, we multiply the original equation by the expression
J rfn dx - Jn\frr _ £ni
C x+1 '
where the corresponding integral is understood to stand for any antiderivative and where any non-zero multiple of the obtained function can be considered (that is why we could remove the absolute value). Thus, consider the equation
i      2V     — x(x~1)
y x+i   o+i)2    x+i ■
The core of the method of integration factor is that fact that the expression on the lei Integrating this leads to
the expression on the left-hand side is the derivative of y fq^.
Xi(t) = Xi(t0) + / x/i(s)ds = ci+ / fi(s,x(s)) ds.
J to J to
We work with the integral operator y i-> L(y), this time mapping curves in R™ to curves in R™. It is desired to find its fixed point. The proof proceeds in much the same way as in the case 8.3.6. It is only necessary to observe that the size of the vector
\\f(t,z1,...,zn)-f(t,y1,...,yn)\\ is bounded from above by the sum
||/(Mi,...^) - f(t,y1,z2 ... ,zn)\\ + ...
+ \\f(t,yi,-- .,yn-i,zn) - /(i,yi,... ,yn)\\.
It is recommended to go through the proof of Theorem 8.3.6 from this point of view and to think out the details. □
8.3.10. Example. When dealing with models in practice, it is of interest to consider the qualitative behaviour of the solution in dependence on the initial conditions and free parameters of the system
We consider a simple example of a system of first-order equations from this point of view. The standard population model "predator - prey", was introduced in the 1920s by Lotka and Volterra.
Let x(t) denote the evolution of the number of individuals in the prey population and y(t) for the predators. Assume that the increment of the prey would correspond to the Malthu-sian model (i.e. exponential growth with coefficient a) if they were not hunted. On the other hand, assume that the predator would only naturally die out if there were no prey (i.e. exponential decrease with coefficient 7). Further, consider an interaction of the predator and the prey which is expected to be proportional to the number of both with a certain coefficient /3, which is in the case of the predator, supplemented by a multiplicative coefficient expressing the hunting efficiency. This is a system of two equations:
lotka-volterra model
ax — ßyx -jy + Sßxy.
The diagram illustrates one of typical behaviours of such dynamical systems - the existence of a closed orbits on which the system moves in time. These are the thick black ovals, while the "comets" indicate the field at the individual points (i.e. their expected movement). The left illustration corresponds to a = 1, (3 = 1, 7 = 0.3, S = 0.3 and the initial condition (x0,y0) = (1,0.5) atfo = 0 for the solution, while the other illustration comes with a = l, /3 = 2, 7 = 2, <5 = 1 and (x0,y0) = (1,1.5)
574
CHAPTER 8. CALCULUS WITH MORE VARIABLES
yf^T = S^TTldx = ^-2x + 21n|x + l|+C, Cg
Therefore, the solutions are the functions
y = z±± fit-2x + 2\n\x + l\+C
As for the latter method, we first solve the corresponding homogeneous equation
which is an equation with separated variables. We have
dy 2y
dx
dy
y
ar 2
x2 - 1
r
■ dx,
In I y I = — In I a; — 1 | + In | a; + 1 | + In | C |,
x + 1
In I y I
In
y:
c-
c
x - 1 x + l
x-V
where we had to exclude the case y = 0. However, the function y = 0 is always a solution of a homogeneous linear differential equation, and it can be included in the general solution. Therefore, the general solution of the corresponding homogeneous equation is
Now, we will consider the constant C to be a function C(x). Differentiating leads to
, _ C'(x) (x+l)(x-l)+C(x) (x-l)-C(x) (x+l)
y — o-i)2 •
Substituting this into the original equation, we get
C'(x) (x+l)(x-l)+C(x) (x-l)-C(x) (x+l) _     _ 2C(x) (x+l) (x-1)2 ~X (x-l)(x2-l)-
It follows that
x+l
C{x) = J^dx,
X2
C{x) = — - 2a; + 21n|a; + 1 | + C,    C g R. Now, it suffices to substitute:
V = C(x) f±i = ^      - 2x + 2 In | x + 1 | + c) ,   C g R.
We can see that the result we have obtained here is of the same form as in the former case. This should not be surprising as the differences between the two methods are insignificant and the computed integrals are the same.
In both cases, the system is quite stable in the vicinity of the initial condition, and it would be very stable for (a;0, yo) = (1,1) or (1, 0.5), respectively. But their developement differs in speed — the depicted solution cycles close in the times about t = 12 in the first case and t = 5 in the other one.
It is interesting that the same model captures quite well the development of the unemployment rate in population, considering the employees to be the predators, while the employers play the role of the prey.
Much information about this and other models can be found in the literature.
8.3.11. Stability of systems of equations. In order to illustrate the stability questions, we discuss just one basic theorem only.
The assumption that the partial derivatives of the functions defining the system are continuous (in fact, it suffices to have them Lipschitz), guarantees the continuity of the solutions in dependence on the initial conditions as well as the defining equations themselves. Note however, that as the distance of t from the initial value to grows, then the estimates grow exponentially!
Therefore, this result is of a strictly local character. It is not in contradiction with the example of the unstably behaving equation y' = ty illustrated in paragraph 8.3.3.7
Consider two systems of equations written in the vector form
(1) x' = f(t,x),   y' = g(t,y)
and assume that the mappings f,g : U C Rn+1 —> R™ have continuous partial derivatives on an open set U with compact closure. Such functions must be uniformly continuous and uniformly Lipschitz on U, so there are the finite values
C= sup
x^y; (t,x), (t,y)eU \x V\
B=   sup   \f(t,x) - g(t,x)\ (t,x)eu
With this notation, the fundamental theorem can be formulated:
Theorem. Let x(t) andy(t) be two fixed solutions
x'(t) = f(t,x(t)), y'(t)=g(t,y(t))
Much more information can be found for example in Gerald Teschl's book Ordinary Differential Equations and Dynamical Systems, Graduate Studies in Mathematics, Volume 140, Amer. Math. Soc, Providence, 2012.
575
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Finally, we can notice that the solution y of an equation   of the systems (1) considered above, given by initial condi-y' = a{x)y can be found in the same way for any continuous   ^ons x(t0) = x0 andy(t0) = y0. Then, function a. We thus always have \x(t) - y(t)\ < \x0 - y0| ec|*~*° +§(ec|*"to1 -l).
y = Ce-r a(x) dx,   C G R. Proof. Without loss of generality, t0 = 0. From the ex-
Jr"""^-<.    pression of the solutions x(t) and y(t) as fixed
Similarly, the solution of an equation y1 = a(x)y + b(x) with <t_)% points of the corresponding integral operators
an initial condition y(x0) = y0 can be determined explicitly ^Jjl^        follows the estimate as (provided the coefficients, i. e. the functions a and b, are continuous)
P alt) dt ( f-x — f-   a(s) ds ,,\
y = eJ-°        [y0 + Jxo b(t) e J-o dt) .
Let us remark that the linear equation has no singular solution, and the general solution contains a C G R. □
\x
(t) - y(t)\ < \xo - Vol + / \f(s,x(s)) - g(s,y(s))\ds.
The integrand can be further estimated as follows:
\f(s,x(s)) - g(s,y(s))\ <
< \f(s,x(s)) - f(s,y(s))\ + \f(s,y(s)) - g(s,y(s))\
< C\x(s) - y(s)\ + B
8.J.10.   Solve the linear equation If F(t) = \x(t) - y(t)\, a = \x0 - y0|,then
(y' + 2xy) e*2 = cos x. F^ <a+ f (C F(s) + B) ds.
Jo
Solution. If we used the method of integration factor, we Such an estimate bound can be exploited further, by would only rewrite the equation trivially since it is already   the following general result, known as Gronwall's inequality.
r .1   i  .   i p . . c, ,    , ., .    Note the similarity with the general solution of linear equa-
of the desired form - the expression on the left-hand side is J & M
2 tions. the derivative of y ex . Thus, we can immediately calculate
Lemma. Assume a real-valued function F(t) satisfies for all
2 \ '
[ye-
cosxdx, F(t)<a(t)+ j ß(s)F(s)ds
C e
y = e~x" (sin x + C) ,   C G
ye-
y&x2 _ sma, _|_ (j    c eR for some real-valued functions a(t), (3(t), with (3(t) > 0.
Then
F(t) < a{t) + /  a(s)ß(s)     ^ dr ds
□
8.J.11. Find all non-zero solutions of the Bernoulli equation
y' - f = 3xy2.
for all t G [0,tmax]. Moreover, if additionally a(t) is non-decreasing, then
F(t) < a(t)e^^ds. Proof of the lemma. Write
G(t) = e~ti^ ds .
Solution. The Bernoulli equation By ±e &st assumption of the ^
y' = a(x)y + b{x)yr,    r/0,r/l,r£R d /        ft \
— I G(t) / /3(s)F(s) ds = can be solved by first dividing by the term yr and then using az \      Jo J
the substitution u = yx~r, which leads to the linear differen- _ p(t)Q(f) (p(f^ _ / /3(s)F(s) ds
tial equation ^ ^
< a(t)P(t)G(t).
u' = (1 — r) \a(x)u + blx)] .
Integrating with respect to t and dividing by the non-zero In this very problem, the substitution u = y1-2 = 1/y gives   function G(t) gives
U' + - = ~3x- f P{s)F{s) ds <     Q(S)/3(S)||| ds,
Similarly to the previous exercise, we have
which, having added a (t) to both sides of the inequality, gives u = e~ ln I x [J —3x eln I x dx] , the first proposition of the lemma.
576
CHAPTER 8. CALCULUS WITH MORE VARIABLES
where In | x was obtained as an (arbitrary) antiderivative to 1/x. Furhter,
u = -—r   f —3a; x I dx
\x\ u
The absolute value can be replaced with a sign that can be canceled, i. e., it suffices to consider
u = i [/ -3a;2 dx] = ± [-x3 + C] , CeK.
Returning to the original variable, we get
y = i = 7^r, CeR.
y       u       C —x* '
The excluded case y = 0 is a singular solution (which, of course, is true for every Bernoulli equation with r positive).
□
8.J.12.   Interchanging the variables, solve the equation
y dx — (x + y2 sin y) dy = 0.
Assuming that a(t) is non-decreasing, there follows:
rt
F(t)<a(t)(l+ / P{s)e^^dr ds). Jo
The integrand is a derivative:
so
F(t)<a(t){\- f ^e/'«r)*dS) Jo ds
= ft(i)(l+e/.'«r)*-l), and the second proposition of the lemma is also proved. □
Now, the proof of the theorem about continuous dependency on the parameters is easily finished. The bound F(t) < a + Jg(C F(s) + B)ds is already obtained, and using a slightly modified function F(t) = F(t) + ^, this yields
F(t) < § + q+ / CF(s)ds.
Solution. When the variable x occurs only in the first power in the differential equation and y occurs in the arguments of elementary functions, we can apply the so-called method of variable interchange, when we look for the solution as for a   which is the statement function x of the independent variable y. First, we write the equation explicitly:
This is the assumption of Gronwall's inequality with even constant parameters, so by the second claim of the lemma,
F(t) < aect+§(ect-l)
as desired.
□
y
x+y1 sin y '
This equation is not of any of the previous types, so we rewrite it as follows:
dy y
dx     x + y2 sin y'
dx     f       y       \   1 x
~T = [ —;——■- =-+ysiny,
dy     \x + yzsmyj y
, 1 x = — x + y sin y.
y
We have thus obtained a linear differential equation. Now, we can easily compute its general solution
x = — y cos y + Cy,   C £ R.
□
Further problems concerning first-order differential equations can be found on page 595.
The continuous dependency on both the initial conditions and the potential further parameters in which the function / would be Lipschitz-continuous follows immediately from the statement of the theorem.
The extremely simple equations in one variable x' = ax, where a are small constants, with their exponential solution x(t) = eat show that better general results cannot be expected.
8.3.12. Differentiable dependance. In practical problems, the differentiability of the obtained solutions is often of interest, especially with regard to the initial conditions or other parameters of the system. In the general vector notation of the system of ordinary equations
y' = f(t,y),
it can always be supposed that the vector function does not depend explicitly on t. If it does, then another variable yo can be added to the other variables yi,... ,y„. Then there is the same system of equations for the curve y'(t) =
577
CHAPTER 8. CALCULUS WITH MORE VARIABLES
K. Practical problems leading to differential equations
8.K.I. A water purification plant with volume 2000 m3 was contaminated with lead which is spread in the water with density 10g/m3. Water is flowing in and out of the basin at 2 m3/s. In what time does the amount of lead in the basin decrease below 10 /ig/m3 (which is the hygienic norm for the amount of lead in drinkable water by a regulation of the European Community) provided the water keeps being mixed uniformly?
Solution. Let us denote the water's volume in the basin by V (m3), the speed of the water's flow by v (m3/s). In an infinitesimal (infinitely small) time unit dt, y ■ v dt grams of lead runs out of the basin, so we can construct the differential equation
m
dm = — — - vat
for the change of the lead's mass in the basin. Separating the
variables, we get the equation
dm        v , — = --dt.
m V
Integration both sides of the equation and getting rid of the logarithms, we get the solution in the form m(t) = m0e~ v*; where m0 is the lead's mass at time t = 0. Substituting the concrete values, we find out that t = 6 h 35 min. □
8.K.2. The speed of transmission of a message in a population consisting of P people is directly proportional to the number of people who have not heard the message yet. Determine the function / which describes the dependency of the number of people who have heard the message on time. Is it appropriate to use this model of message transmission for small or large values of PI
Solution. We construct a differential equation for /. The speed of the transmission ^ = f'(i) should be directly proportional to the number of people who have not heard of it, i. e. the value P — f(t). Altogether,
!=*(*-/(*)).
Separating the variables and introducing a constant K (the number of people who know the message at time t = 0 must be P — K), we get the solution
f(t) = P- Ke~kt,
where k is a positive real constant.
Apparently, this model makes sense for large values of P only. □
(yo(t),yi(t),...,yn(t)) as 2/o = l
y[ = /i(yo,yi, • • • ,yn) y'n = fn(yo,yi,-- ■ ,yn)
with the initial conditions
yoOo) = *o, yi(A>) = x1,...,yn(t0) =xn.
Such systems, which do not explicitly depend on time, are called autonomous systems of ordinary differential equations.
Without loss of generality, we deal with autonomous systems in finite dimension n, dependent on parameters A and with initial conditions
(1) v' = f(v,X),v(to) = x.
Without loss of generality, consider the initial value to = 0, and write the solution with y(0) = x in the form y(t, x, A) to emphasize the dependency on the parameters.
For fixed values of the initial conditions (and the potential parameters A), the solution is always once more differen-tiable than the function /. This can be derived inductively by applying the chain rule. If / is continuously differentiable and y(t) is a solution, then (use the matrix notation where the Jacobi matrix D1f(y) of the mapping / : R™ —> R™ is multiplied with the column vector y')
y" = D1f(y)-y/ = D1f(y)-f(y)
exists and is continuous.
With all the derivatives up to order two continuous, there is an expression for the third derivative:
y<3) = D2f(y)(f(y)J(y)) + (D1/(y))2 ■ f(y).
Here, the chain rule is used again, starting with the differential of the bilinear mapping of matrix multiplication and viewing the second derivative as a bilinear object evaluated on y' in both arguments. Think out the argumentation for this and higher orders in detail.
Assume for a while that there is a solution y(t,x) of the system (1) which is continuously differentiable in the parameters x £ Rn, i.e. the initial condition as well, and forget about the further parameters A for now. Write
<P(t,x) = Dx(y(t,x)),
for the Jacobi matrix of all partial derivatives with respect to the coordinates x{, which depends on the time t as well as the initial condition x. Its derivative <P'(t, x) with respect to t can be computed using the symmetry of partial derivatives and the chain rule:
&(t,x) = ±(D1xy(t,x))=D1x(y'(t,x)) = D1f(y(t,x))-D1xy(t,x) = D1f(y(t,x))-$(t,x).
578
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.K.3. The speed at which an epidemic spreads in a given closed population consisting of P people is directly proportional to the product of the number of people who have been infected and the number of people who have not. Determine the function f(t) describing the number of infected people in time.
Solution. Just like in the previous problem, we construct a differential equation:
ft=k-f(t) (P-/(*)).
Again, separating the variables and introducing suitable constants K and L, we obtain
K
f(t)
i + Le-
rn
8.K.4. The speed at which a given isotope of a given chemical element decays is directly proportional to the amount of the given isotope. The half-life of the isotope of plutonium 949Pu is 24,100 years. In what time does a hundredth of a nuclear bomb whose active component is the mentioned isotope disappear?
Solution. Denoting the amount of plutonium by m, we can build a differential equation for the rate of the decay:
dm
- = k ■ m,
dt
where k is an unknown constant. The solution is thus the function m(t) = m0e~kt. Substituting into the equation for half-life(e-fe* = ±), we get the constant k = 2.88 ■ 105. The wanted time is then approximately 349 years. □
8.K.5. The acceleration of an object falling in a constant gravitational field with a certain resistance of the environment is given by the formula
dv ~dt
■g-kv,
where k is a constant which expresses the resistance of the environment. An object was dropped in a gravitational field with g = 10 ms"2 at the initial speed of 5 ms-1, the resistance constant is k = 0.5 s~ . What will the speed of the object be in three seconds?
Solution.
-Hi-)'--
So the derivatives with respect to the initial conditions along the solution y(t, x) of the system (1) are given as the solutions of a system of n2 first-order equations with initial condition
(2)      $'(t,x) = F(t,x) ■ <P(t,x),   $(0,x) = E,
where F(t,x) = D1f(y(t,x)), and the initial condition comes out from the identity y(0, x) = x. The unique existence of the solution of this (matrix) system and its continuous dependence on the parameters have already been proved.
The following theorem says that for systems (1) with continuously differentiable right-hand sides /, the derivatives with respect to the initial condition can be obtained in this way.
Differentiability of the solutions
Theorem. Consider an open subset U C Rn+fe and a mapping f : U —» Rn with continuous first derivatives. Then, a system of differential equations dependent on a parameter A G Rfe with initial condition at a point x G U
y' = f(y,\), y(0)=x
has a unique solution y(t, x, A), which is a mapping with continuous first derivatives with respect to each variable.
Proof. Consider a geeneral system dependent on parameters, but viewed as an ordinary autonomous system with no parameters. More explicitely, consider the I' parameters to be additional space variables and add ■i (vector) conditions \'(t) = 0 and A(0) = A. There-fore.the theorem is proved for autonomous systems with no further parameters. There is dependency on the initial conditions.
Just as in the proof of the fundamental existence theorem 8.3.6, build on the expression of the solutions as fixed points of the integral operators and prove that the expected derivative, as discussed above, enjoys the properties of the differential. Fix a point x0 as the initial condition, together with a small neighbourhood x0 G V, which if necessary can be further decreased during the following estimates, so that
\f(y)-f(z)\<C\y-z\
on this neighbourhood by the Lipschitz property. It is already deduced that if the derivative
$(t,x) = Dly(t,x)
of the solution y(t,x) exists, then it must be uniquely given by the equation (2) wit the proper initial conditions. Therefore, define <P(t, x) by this equation and examine the expression
G{t,h) = \\y(t,x0 + h) - y(t,x0) - <P(t,x0)(h)\\
with small increments h G Rn. In order to prove that the continuous derivative exists, it is necessary to show that
hm J_G(t,h) h^O \\h\\
0.
v(3) = 20 — 15e 2 ms 1 after substitution.
□   Several estimates are needed for this purpose.
579
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.K.6. The rate of increase of a population of a certain type of bug is indirectly proportional to its size. At time t = 0, the population had 100 bugs. In a month, the population doubled. What will the size of the population be in two months?
Solution. Let us consider a continuous approximation of the number of bugs, and let their amount be denoted by P. Then, we can build the following equation:
dP _ k ~dt' ~P'
P = y/Kt + c. Substituting the given values, we get P(2) = \fl ■ 100, which is an estimate of the actual number of bugs.
□
8.K. 7. Find the equation of the curve with the following properties: It lies in the first quadrant, goes through the point [1,3/4], and its tangent at any point marks on the positive half-axis y a segment whose length is the same as the distance of that point from the origin. O
8.K.8. Consider a chemical compound C isolated in a container. C is unstable, with half-time of a molecule equal to q time units. If there were M moles of the compound C in the container at the beginning (i. e., at time t = 0), how many moles of it will be there at time t > 0? O
8.K.9. A 100-gram body lengthens a spring of 5 cm if hung on it. Express the dependency of its position on time t provided the speed of the body is 10 cm/s when going through the equilibrium point. O Further practical problems that lead to differential equations can be found on page 595.
L. Higher-order differential equations
8.L.I. Underdamped oscillation. Now, we will describe a simple model for the movement of a solid object attached to a point with a strong spring. If y(t) is the deviation of our object from the point yo = y(0) = 0, then we can assume that the acceleration y"(t) in time t is proportional to the magnitude of the deviation, yet with the other sign. The proportionality constant k is called the spring constant. Considering the case k = 1, we get the so-called oscillation equation
y"(i) = -y(t).
This equation corresponds to the system of equations
x'(t) = -y(t),   y'(t) = x(t)
First, from the latter theorem about continuous dependence on initial conditions, the estimate
\\y(t,x0 + h)-y(t,x0)\\ < \\h\\ ec^
follows immediately. In the next step, use Taylor's expansion of / with remainder:
f(y)-f(z) = D1f(z)-(y-z) + R(y,z),
where R(y, z) satisfies
R(y,z)
0 for \\y — z\
0.
\\y- zW
This implies the crucial estimate. In the first equality substitute in the expression of solutions in terms of fixed points of the integral operators. Next, exploit the definition of the mapping <P(t,x0) in terms of its derivative (write F(t,x) = D1 f(y(t, x)) again and notice that its initial condition <P(0,x)(h) = h implies the vanishing of the h summand).
G(t,h) =
x0+h+ /  f(y(s,x0 + h))ds - x0
f(y(s,x0))ds -<P(t,x0)(h)
f(y(s, x0 + h))- f(y(s, x0))
F(s,x0)<P(s,x0)(h) )ds
<
<
\f(y(s, x0 + h)) - f(y(s, x0))
- F(s,x0)<P(s,x0)(h)\\ ds
t
\\F{s,xo)\\ \\y{s,x0+h) -y(s,x0) - <P(s, x0)(h)\\ ds t
+ 1   \\R(y(s,x0 + h),y(s,x0))\\ds,
where the norm on the matrices is taken as the maximum of the absolute values of their entries.
Since F(t, x) is continuous, there is a uniform bound of its norm in the neighbourhood V given by
\\F(t,x0)\\ < B,
for all | i | < T with a sufficiently small T to ensure the solutions remain in the neighbourhood V. At the same time, for any fixed constant e > 0, there is a bound \\h\\ < S for which the remainder R satisfies
\\R(y(t,x0 + h),y(t,x0))\\ < e\\y(t,x0 + h) - y{t,x0)\\
< \\h\\eeCT.
Therefore, the estimate on G(t, h) can be improved as follows:
G{t,h)<B [ G(s,h)ds + e\\h\\eCT. Jo
580
CHAPTER 8. CALCULUS WITH MORE VARIABLES
from 1. The solution of this system is given by
x(t) = Rcos(t — r),    y(t) = Rsin(t — r)
with an arbitrary non-negative constant R, which determines the maximum amplitude, and a constant r, which determines the initial phase.
Therefore, in order to determine a unique solution, we need to know not only the initial position yo, but also the speed of the motion at that moment. These two pieces of information uniquely determine both the amplitude and the initial phase.
Moreover, let us imagine that as a result of the properties of the spring material, there is another force which is directly proportional to the instantaneous speed of our object, with the other sign than the amplitude again. This is expressed by one more term with the first derivative, so our equation is now
y"(t) = -y(t)-ay'(t),
where a is a constant which expresses the magnitude of the damping. In the following picture, there are the so-called phase diagrams for solutions with two distinct initial conditions, namely with zero damping on the left, and for the value of the coefficient a = 0.3 on the right.
The oscillations are expressed by the y-axis values; the x-axis values describe the speed of the motion.
8.L.2. Undamped oscillation. Find the function y(t) which satisfies the following differential equation and initial conditions:
y"(t) +4y(i) = f(t), y(0) = 0, y'(0) = -1, where the function f(t) is piecewise continuous:
fcos(2f)   for 0 < t < 7T,
0
for t > TT.
Gronwall's lemma now gives
G(t,h)<e\\h\\e-C+B-T. This implies that lim/^o -^G(t, h) = 0 as requested. □
In the same way, it can be proved that continuous differentiability of the right-hand side up to order k 7   (inclusive) guarantees the same order of differentiability of solutions in all input parameters.
8.3.13. The analytic case. Let us pay additional attention to the case when the right hand side / of the system of equations
(1) y' = f(y),  y(to) = yo
is analytic in all arguments (i.e. a convergent multidimensional power series /(y) = ___|^|=o -h.^rVa- see 8.1.15). Exactly as in the previous discussion, we may hide the time variable t as well as further parameters in the variables.
The famous theorem below says that the solution of the most general system with analytic right-hand side is analytic in all the parameters as well (including the initial conditions).
ODE version of Cauchy-Kovalevskaya Theorem
Theorem. Assume f(y) is a real analytic vector valued function on a domain in Rn and consider the differential equation (1). Then the unique solution of this initial problem is real analytic, including the dependancy on the initial condition.
Solution. This problem is a model of undamped oscillation of a spring (omitting friction, non-linearities in the toughness
Proof. The idea of the proof is identical as in the simple one-dimensional case in 6.2.15. As we saw in the beginning of the previous paragraph, there are universal (multidimensional) polynomial expressions for all derivatives of the vector function y(t) in terms of the partial derivatives of the vector function /. If we expand them in terms of the individual partial derivatives of the mapping / all of their coefficients are obviously non-negative. Let us write again
y(fc)(0) = ft(/(y(0)),...,^/(y(0)),...)
for these multivariate vector valued polynomials (the multi-indices (3 in the arguments are all of size up to k — 1).
Without loss of generality we may consider the initial condition to = 0, y(0) = 0. Indeed, constant shifts of the variables (say z = y — yo, x = t — t0) transform the general case to this one. Once we know that the components of the solution are power series, the transformed quantities will be analytic too, including the dependancy on the values of the incital conditions.
In order to prove that the solution to the problem y' = /(y), y(0) =0 is analytic on a neighborhood of the origin, we shall again look for a majorant g for the vector equation y' = /(y), i.e. we want an analytic function on a neighborhood of the origin 0 6 R" with dag(0) > \daf(0)\, for all multi-indices a. Then, by the universal computations of all the coefficients of the power series y(t) = J2T=o H^Ho)^
581
CHAPTER 8. CALCULUS WITH MORE VARIABLES
of the spring, and other factors) which is initiated by an outer force.
The function j(t) can be written as a linear combination of Heaviside's function u(t) and its shift, i. e.,
/(*) =cos(2t)(u(t)-uff(t))
Since
C{y"){s) = s2C{y) - sy(0) - y'(0) = s2C{y) + 1,
we get, applying the results of the above exercises 7 and 8 to the Laplace transform of the right-hand side
s2C(y) + 1 + 4C(y) = £(cos(2i)(u(i) - uw(t)))
= £(cos(2/j) ■ u(t)) - £(cos(2/j) ■ u^t))
= £(cos(2/j)) - e-*s£(cos(2(/j + tt))
(l-e-™)-
Hence,
Ay)
i
■2+4'
+ (l-e"
s2 + 4    v y(s2 + 4)2'
Performing the inverse transform, we obtain the solution in the form
s
y(t) = -\sin(2/j) + \t sin(2/j) + C~1\e However, by formula (1), we have
(s2 + 4)2
(s2+4)2;=^"1(e"^(ísin(2í)))
= (t - tt) sin(2(/j - tt)) ■ H^t).
Since Heaviside's function is zero for t < tt and equal to 1 for t > tt, we get the solution in the form
J-±sin(2/j) + ±/jsin(2/j)   for 0 < t < tt j^sin^f) forf>7r
□
8.L.3.   Find the general solution of the equation
y'" - 5y" - 8y' + 48y = 0.
Solution. This is a third-order linear differential equation with constant coefficients since it is of the form
2/(") + aiy^"1) + a2y^ + ■■■ + an-lV' + any = f(x) for certain constants ai,..., an e K. Moreover, we have j(x) = 0, i. e., the equation is homogeneous.
First of all, we will find the roots of the so-called characteristic polynomial
\n + ai A™"1 + a2Are-2 + ■ ■ ■ + a„_iA + an. Each real root A with multiplicity k corresponds to the k solutions
solving potentially our problem, and similarly for z' = g{z), the convergence of the series for z implies the same for y:
zW(0) = Pk(g(0),...,df)g(0),...)
>Pk(\f(0)\,..., |^/(0)|) > |yW(0)|.
As usual, knowing already how to find a majorant in a simpler case, we try to apply a straightforward modification.
By the analycity of /, for r > 0 small enough there is a constant C such that -^daji(Q)r^ < C, for all i = 1,... ,n and mutli-indices a. This means \daji(Q)\ < C^j. In the 1-dimensional case, we considered the multiple of a geometric series g(z) = C-^—z with the right derivatives g^ = C^. Now the most similar mapping is g{z1,...,zn) = (gi(zi,...,zn),...,gn(zi,...,z„)) wilh all the components gi equal to
h
h(z1, ...,zn) = C-
r - zi-----zn
Then the values of all the partial derivatives with \a\ = k at
z = 0 are
dah(0) = Crk\(r -Zl-----zn)-k~\z=0
C
k\
exactly as suitable. (Check the latter simple computation yourself!)
So it remains to prove that the majorant system z' = g(z) has got the converging power series solution z. Obviously, by the symmetry of g (all compoments equal to the same h and h is symmetric in the variables Zi), also the solution z with z(0) = 0 must have all the components equal (the system does not see any permutation of the variables Zi at all). Let us write zi(t) = u(t) for the common solution components. With this ansatz,
u'(t) = h(u(t),u(t),u(t)) = C-
r — nu(t)
This is nearly exactly the same equation as the one in 6.2.15 and we can easily see its solution with u(0) = 0:
-II
71
1
2nCt
Clearly, this is an analytic solution and the proof is finished.
□
8.3.14. Flows of vector fields. Before going to higher-order equations, pause to consider systems of first-order equations from the geometrical point of view. When drawing illustrations of solutions earlier, we already viewed the right hand side of an autonomous system is considered as a field of vectors j(x) e Rn. This shows how fast and in which direction the solution should move in time.
This can be formalized. Consider the vector field X(x) = j(x). defined on an open set U C K™. Define the derivative in the direction of the vector field X for all differentiable functions g on U by
X(g):U^R,       X(g)(x) = dx{x)g.
582
CHAPTER 8. CALCULUS WITH MORE VARIABLES
and every pair of complex roots A = a ± i/3 with multiplicity k corresponds to the k pairs of solutions
eax cos (fa) , x eax cos (fa)        a^"1 eax cos (fa) , eax sin (fa) , a; eax sin (/3a) ,..., afe_1 eQX sin (fa) .
Then, the general solution corresponds to all linear combinations of the above solutions.
Therefore, let us consider the polynomial
A3 - 5A2 - 8A + 48
with roots Ai = A2 = 4, A3 = —3. Since we know the roots, we can deduce the general solution as well:
y = C\eix + C2xeix + C3e~3x,   C\,C2,C3eR. □
8.L.4. Compute
y'" + y" + 9y' + 9y = ex + 10 cos (3a).
Solution. First, we will solve the corresponding homogeneous equation. The characteristic polynomial is equal to
A3 + A2 + 9A + 9,
with roots Ai = — 1, A2 = 3i, A3 = —3i. The general solution of the corresponding homogeneous equation is thus
y = C,ie-x + C,2cos(3a) + C,3sin(3a),   Ci,C2,C3 G R.
The solution of the non-homogeneous equation is of the form
V =
Cie-* + C2 cos (3a) + C3 sin (3a) + yp,   Cx, C2, C3 G R
for a particular solution yp of the non-homogeneous equation.
The right-hand side of the given equation is of a special form. In general, if the non-homogeneous part is given by a function
Pn(x)taX,
where Pn is a polynomial of degree n, then there is a particular solution of the form
yp = xkRn(x) eax,
where k is the multiplicity of a as a root of the characteristic polynomial and Rn is a polynomial of degree at most n. More generally, if the non-homogeneous part is of the form
eax [Pm(x) cos (fa) + Sn(x) sin (fa)] ,
where Pm is a polynomial of degree m and Sn is a polynomial of degree n, there exists a particular solution of the form
yp = xkeax [Ri(x) cos (fa) + 7] (a) sin (fa)],
So the vector field X maps functions into functions. Apply the chain rule for the differentials to obtain the derivative rule for products of functions:
X(gh) = hX(g)+gX(h).
Consider the composition of the product of real numbers with the couple of functions (/, g).
In coordinates, X(x) = (X1(x),... Xn(x)) and
X(g)(x) = X1(x)-j£-(x) + --- + Xn(x)-^-(x).
Fixing the coordinates, there are the special vector fields with coordinate functions equal to zero except for one function X{ which is identically one. Such a field then corresponds to the partial derivatives with respect to the variable x{. This is also matched by the common notation g|- for such vector fields and in general,
X(x)=X1(x)-^- + --- + Xn(x)
dxn
The set of all possible tangent vectors at the points of an open subset U G R™ is called the tangent space TU. The vector space of all vectors at a point a is denoted by TXU. Use the notation X(U) for the set of all smooth vector fields on U. Vector fields can be perceived as generators of X(U), admitting smooth functions as the coefficients in linear combinations.
We return to the problem of finding the solution of a system of equations. Rephrase it equivalently as finding a curve which satisfies
x'(t) = X(x(t))
for each value x(t) in the domain of the vector field X. In words: the tangent vector of the curve is given, at each of its points, by the vector field X. Such a curve is called an integral curve of the vector field X, and the mapping
Flf : Rn -4- Rn,
defined at a point a0 as the value of the integral curve x(t), satisfying a(0) = a0 is called the flow of the vector field X. The theorem about existence and uniqueness of the solution of the systems of equations says (cf. 8.3.6) that for every continuously differentiable vector field X, its flow exists at every point a0 of the domain for sufficiently small values of t. The uniqueness guarantees that
Flf+S(a) = Flf oFLf(a),
whenever both sides exist. In particular, the mappings Fl^ and Fl^ always commute.
Moreover, the mapping Fl^ (a) with a fixed parameter t is differentiable at all points a where it is defined, cf. 8.3.12.
If a vector field X is defined on all of R™, and if its support is compact, then its flow clearly exists at all points and for all values of t. Vector fields with flows existing for all t G R are called complete. The flow of a complete vector field consists of (mutually commuting) diffeomorphisms Flf : Rn -4- Rn with inverse diffeomorphisms Flf t.
583
CHAPTER 8. CALCULUS WITH MORE VARIABLES
where k is the multiplicity of a + i/3 as a root of the characteristic polynomial and Ri, Ti are polynomials of degree at most / = max {m, n}.
In our problem, the non-homogeneous part is a sum of two functions in the special form (see above). Therefore, we will look for (two) corresponding particular solutions using the method of undetermined coefficients, and then we will add up these solutions. This will give us a particular solution of the original equation (as well as the general solution, then). Let us begin with the function y = ex, which has particular solution ypi (x) = Atx for some 4eR. Since
vvi (x) = y'P1 (x) = y'v\ (x) = y'p'i (x) = A&x^
substitution into the original equation, whose right-hand side contains only the function y = ex, leads to 2(L4ex = ex,   i. e.   A = For   the   right-hand   side   with   the function y  =  10 cos (3a;), we are looking for a particular solution in the form
Vp2 (x) = X[B cos (3a;) + C sin (3a;)] . Recall that the number A = 3i was obtained as a root of the characteristic polynomial. We can easily compute the derivatives
y'p2 (x) ~ cos (3a;) + C sin (3a;)] +x [-3B sin (3a;) + 3C cos (3a;)], y'lJ2 (x) = 2 [-3B sin (3a;) + 3C cos (3a;)] +x [-9B cos (3a;) - 9C sin (3a;)], (x) = 3 [-9B cos (3a;) - 9C sin (3a;)] +x [27B sin (3a;) - 27C cos (3a;)].
Substituting them into the equation, whose right-hand side contains the function y = 10 cos (3a;), we get
-185 cos (3a;) - 18C sin (3a;) - 6B sin (3a;) + 6C cos (3a;) = 10 cos (3a;). Confronting the coefficients leads to the system of linear equations
-185 4-6(7 = 10,   -18C-6£ = 0 with the only solution B = —1/2 and C = 1/6, i. e., Vp2 (x) — x [~\ cos (3a;) + | sin (3a;)] . Altogether, the general solution is
y = Cxtrx + C2 cos (3a;) + C3 sin (3a;) + ^ex — -a; cos (3a;) + -a;sin (3a;) ,    C\, C2, C3 £ R.
A simple example of a complete vector field is the field X(x) =       Its flow is given by
Flf {xu ...,xn) = (xi +t,x2, ...,xn).
On the other hand, the vector field X(t) = t2-^ on the one-dimensional space R is not complete as its solutions are of the form
except for the initial condition t = 0, so they "run away" towards infinite values in a finite time.
The points x0 in the domain of a vector field X : U C R™ —> R™ where X(x0) = 0 are called singular points of the vector field X. Clearly Fl^(a;o) = x0 for all t at all singular points.
8.3.15. Local qualitative description. The description of vector fields as assigning the tangent vector in the modelling space to each point of the Euclidean space is independent of the coordinates. It follows that the flows exhibit a geometric concept which must be coordinate-free.
It is necessary to know what happens to the fields and their flows, when coordinates are transformed. Suppose y = F(x) is such a transformation with F : R™ —> R™ (or on some smaller domain there). Then the solutions x(t) to a system x' = X(x) satisfy x'(t) = X(x(t)), and in the transformed coordinates this reads
y'(t) = {F(x(t)))'(t) = D1F(x(t))-x'(t) = D1F{x{t)) ■ X{x{t)).
This means that the "transformed field" Y in the new coordinates is Y(F(x)) = D1F(x) ■ X(x). At the same time, the flows of these vector fields are related as follows:
Flf oF(x) = FoFlf (a;).
By fixing x = x0 and writing x(i) = Fl^(a;o), the curve F(x(t)) is the unique solution for the system of equations y' = Y(y) with initial condition yo = F(x0), which equals the right-hand side.
The following theorem offers a geometric local qualitative description of all solutions of systems of first order ordinary differential equations in a neighbourhood of each point x which is not singular.
The flowbox theorem
Theorem. If X is a dijferentiable vector field defined on a neighbourhood of a point xq £ W1 and X{xq) 0, then there exists a transformation of coordinates F such that in the new coordinates y = F(x), the vector field X is given as the field jp--
J oyi
584
CHAPTER 8. CALCULUS WITH MORE VARIABLES
□
8.L.5.   Determine the general solution of the equation
y" + 3y' + 2y = e~2x.
Solution. The given equation is a second-order (the highest derivative of the wanted function is of order two) linear (all derivatives are in the first power) differential equation with constant coefficients. First, we solve the homogenized equation
y" + 3y' + 2y = 0. Its characteristic polynomial is
x2 + 3x + 2 = (x + l)(x + 2),
with roots x1 = —1 and x2 = —2. Hence, the general solution of the homogenized equation is
c1e~x + c2e~2x,
where c\, c2 are arbitrary real constants.
Now, using the method of undetermined coefficients, we will find a particular solution of the original non-homogeneous equation. According to the form of the non-homogeneity and since —2 is a root of the characteristic polynomial of the given equation, we are looking for the solution in the form yo = axe~2x for a£l,
Substituting into the original equation, we obtain
a[-4e~2x+4xe~2x+3(e~2x-2xe-2x)+2xe-2x] = e~2x,
hence a = —1. We have thus found the function —xe~2x as a particular solution of the given equation. Hence, the general solution is the function space c\e~x + c2e~2x — xe~2x,
ci,c2eK. □ 8.L.6.   Determine the general solution of the equation
y" + y' = l.
Solution. The characteristic polynomial of the given equation is x2 + x, with roots 0 and —1. Therefore, the general solution of the homogenized equation is c\ + c2e~x, where ci,c2 G R.
We are looking for a particular solution in the form ax, a G R (since zero is a root of the characteristic polynomial). Substituting into the original equation, we get a = 1. The general solution of the given non-homogeneous equation is
ci + c2e~x + x, ci, c2 G R. □
Proof. Construct a diffeomorphism F with the required j.',, properties, step by step. Geometrically, the essence of the proof can be summarized as follows: first se-t lect a hypersurface which goes through the point x0 ?J< 1 and is complementary to the directions X(x) near to x0. Then fix the coordinates on it, and finally, extend them to some neighbourhood of the point x0 using the flow of the field X.
Without loss of generality, move the point x0 to the origin by a translation. Then by a suitable linear transformation on
R",setX(0) = ^7(0).
With such coordinates (xi,... ,xn), write the flow of the field X going through the point (x1,..., xn) at time t = 0 as Xi(t) = ipi(t, xi,..., xn). Next, define the components of F=      •••,/«) as
fi(x1,. ..,Xn)= ipi(x1,Q,X2, ■ ■ -,Xn).
This follows the strategy. Since X(0,..., 0)
a
dxi
dF
dxi
(0,...,0)
d
dt |o (1,0,
(vi(*,o,
0),
.,0),...<^(i,0,...,0))
while the flow Fl0 at the time t = 0 yields
<m0,0,X2, ...,Xn) = [Q,X2, . .
and in particular
dF
dx
-(o,...,o) = (0,.
,1,
0),   i = 2,
Therefore, the Jacobi matrix of the mapping F at the origin is the identity matrix E, so it is a transformation of coordinates on some neighbourhood (see the inverse mapping theorem in paragraph 8.1.22).
Directly from the definition of the mapping F in terms of the flow of the vector field X, the flow of the transformed field Y is expressed in the new coordinates (yi,..., yn) as
F\J(yi,---,yn) = (yi+t,y2,. Verify this by yourselves in detail!
,Vn)
□
8.3.16. Higher-order equations. An ordinary differential equation of order k (solved with respect to the IS! 1Y      highest derivative) is an equation
(1)
y
W=f(t,y,y',...,yV<-%
where / is a known function of k + 1 variables, t is the independent variable, and y(t) is an unknown function of one variable. This type of equation is always equivalent to a system of k first-order equations.
Introduce new unknown functions in a variable t as follows:
yo(t) = y(t), yi(t) = y'0(t),..., yk-i(t) = y'k-2(t).
Now, the function y(t) is a solution of the original equation (1) if and only if it is the first component of the solution of
585
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.L.7.   Determine the general solution of the equation
y" + 5y' + Qy = e"2 .
Solution. The characteristic polynomial of the equation is
x2 + 5x + 6 = (x + 2)(x + 3), its roots are —2 and —3. The general solution of the homogenized equation is thus c1e~2x + c2e~3x, ci, c2 G R. We are looking for a particular solution in the form axe~2x, (—2 is a root of the characteristic polynomial), a G R, using the method of undetermined coefficients. Substitution into the original equation yields a = 1. Hence, the general solution of the given equation is
c1e~2x + c2e~3x + xe~2x.
□
8.L.8.   Determine the general solution of the equation
y"-y' = 5.
Solution. The characteristic polynomial of the equation is x2 — x, with roots 1, 0. Therefore, the general solution of the homogenized equation is c\ + c2ex, where ci, c2 G R. We are looking for a particular solution in the form ax, a G R, using the method of undetermined coefficients. The result is a = — 5, and the general solution is of the form
ci + c2ex — 5x.
□
8.L.9.   Solve the equation
y"-2y' + y = ^.
Solution. We will solve this non-homogeneous equation using the method of variation of constants. We will thus obtain the solution in the form
V = C1{x)y1{x) + C2(x)y2(x) H-----h Cn(x)yn(x),
where y±,..., yn give the general solution of the corresponding homogeneous equation and the functions C\ (x),..., Cn (x) can be obtained from the system
C\{x) yi(x) + ■■■+ C'Jx) yn{x) = 0, C1(x)y'1(x) + ---+Cn(x)y'n(x) = 0,
C\(x)y{r2\x) + ... + Cn(x) y{r2)(x) = 0, C^y^ix) + ■■■ + Cn(x) y^Xx) = f(x).
the system of equations
y'o =v\ y[ = yz
y'k-2 = Vk-i
y'k-i =/(i,yo,yi,---,yfc-i)-Hence the following direct corollary of the theorems 8.3.9-8.3.12:
Solutions of higher-order ODEs
Theorem. Consider a function f(t,yo, ■■■ ,yk-i) '■ U C B>fc+i —> R with continuous partial derivatives on an open setU. Then for every point (t0, z0,... ,Zk-i) £ U, there exists a maximal interval Imax = [xq — a, x0+b], with positive numbers a, b G R, and a unique function y(t) : Imax —> R which is a solution of the k-th order equation
yW=f{t,y,y',...,y(k-V)
with the initial condition
y(t0) = z0,y'(t0) = zi,.. .,y(fe-1)(*o) = zfc-i-
Moreover, this solution depends differentiably on the initial conditions and on potential further parameters differentiably entering the function f.
In particular, the theorem shows that in order to determine unambiguously the the solution of an ordinary fc-th order differential equation, the values of the solution and its first k — 1 derivatives must be determined at one point
With a system of I equations of order k, the same procedure transforms this system to a system of kl first-order equations. Therefore, an analogous statement about existence, uniqueness, continuity, and differentiability is also true.
If the right-hand side / of the equation is differentiable up to order r or analytic, including the parameters, than the same property is enjoyed by the solutions as well.
8.3.17. Linear differential equations. The operation of differentiation can be viewed as a linear mapping from (sufficiently) smooth functions to functions. Multiplying the derivatives of the particular orders j by fixed functions aj (t) and adding these expressions, gives the linear differential operators y(t) n> D(y)(t):
D(y)(t) = afc(f)y(fc)(f) + ■ ■ ■ + ai(t)y'(t) + a0(t)y(t).
To solve the corresponding homogeneous linear differential equation of order k then means finding a function y satisfying
D(y) = 0.
The sum of two solutions is again a solution, since for any functions y± and y2,
D(Vl + y2)(t) = D(Vl)(t) + D(y2)(t).
A constant multiple of a solution is again a solution. So the set of all solutions of a fc-th order linear differential equation
586
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The roots of the characteristic polynomial A2 — 2 A + 1   is a vector space. Apply the previous theorem about existence are Ai = A2 = 1. Therefore, we are looking for the solution   md uniqueness, to obtain the following: in the form The space of solutions of linear equations
d(x)ex + C2(x)xtx, considering the system
C1(x)ex + C2(x)xex = 0, C\(x)ex+C2(x)[ex+ xex]
x2 + l
We can compute the unknowns C'^x) and C'2(x) using Cramer's rule. It follows from
ax ~. ax
ex   eř + x eř
J2x
0        x ex e"
^x^Ti'
ď 0
2+
~2x
e     x2+1      x + 1
that
Ci(z) = -
c2 + l
dx =-\\n(x2+ Í)+d, C\e
C2(x) =
dx
x2 + 1
Hence, the general solution is
= arctana; + C2,   C2 G
y = dex + C2xď - \ ex In (a;2 + l) + xexarctana;,   (' . ( % G R.
□
8.L.10. Find the only function y which satisfies the linear differential equation
y& - 3y' -2y = 2ex,
with initial conditions y(0) = 0, y'(0) = 0, y"(0) = 0.
Solution. The characteristic polynomial is a;3 — 3a; — 2, with roots 2 and —1 (double). We are looking for a particular solution in the form aex, a G R, easily finding out that it is the function — \ex. The general solution of the given equation is thus
cxe2x + c2e~x + c3xe~x - -e .
Substituting into the original conditions, we get the only satisfactory function,
-e2x + —e~x + -xe~x - -e . 9 18 3 2
□
Further problems concerning higher-order differential equations can be found on page 599
Theorem. The set of all solutions of a homogeneous linear differential equation of order k with continuously differen-tiable coefficients is a vector space of dimension k. Therefore, the solutions can be described as linear combinations of any set ofk linearly independent solutions. Such solutions are determined uniquely by linearly independent initial conditions on the value of the function y(t) and its first k — 1 derivatives at a fixed point t0.
Proof. Choose k linearly independent initial conditions at a fixed point. For each of them, there is a unique solution. A linear combination of these initial condition then leads to the same linear combination of the corresponding solutions. All of the possible initial conditions are exhausted, so the entire space of solutions of the equation is obtained in this way. □
The same arguments as with the first order linear differential equations in the paragraph 8.3.4 reveal that all solutions of 1 the non-homogeneous fc-th order equation D(y) = b(t) with a fixed continuous function b(t) are the sums of one fixed solution y(t) of this problem and all solutions y of the corresponding homogeneous equation. Thus the entire space of solutions is an affine fc-dimensional space of functions. The method of variation of constants exploited in 8.3.4 is one of the possible approaches to guess one non-homogeneous solution if we know the complete solution to the homogeneous problem.
We shall illustrate the latter results on the most simple case:
8.3.18. Linear equations with constant coefficients. The
previous discussion recalls the situation with homogeneous linear difference equations dealt with in paragraph 3.2.1 of the third chapter. The analogy goes further when all of the coefficients a,j of the differential operator D are constant. Such first-order equations (1) have solutions as an exponential with an appropriate constant at the argument. Just as in the case of difference equations, it suggests trying whether such a form of the solution y(t) = ext with an unknown parameter A can satisfy an equation of order k. Substitution yields
D(ext) = (ak\k + flfc-iA^-1 + ■ ■ ■ + aiX + a0) ext.
The parameter A leads to a solution of a linear differential equation with constant coefficients if and only if A is a root of the characteristic polynomial ak\k + ■ ■ ■ + ai A + ao.
If this polynomial has k distinct roots, then we have the basis of the whole vector space of solutions. Otherwise, if A is a multiple root, then direct calculation, making use of the fact that A is then a root of the derivative of the characteristic polynomial as well, yields that the function y(t) = text is also a solution. Similarly, for higher multiplicities £, There are £ distinct solutions ext,text,..., t1-1 ext.
587
CHAPTER 8. CALCULUS WITH MORE VARIABLES
M. Applications of the Laplace transform
Differential equations with constant coefficients can also be solved using the Laplace transform.
8.M.I.   Let C(y) (s) denote the Laplace transform of a function y(t). Integrating by parts, prove that Solution.
(1) C(y')(s) = sC(y)(s)-y(0)
C(y")(s) = s2C(y)-sy(Q)-y'(Q)
and, by induction:
C(y^)(s) =        snC(y)(s) -
ELi^-y^Ho). □
8.M.2.   Find the function y (t) which satisfies the differential equation
y"(t) + Ay(t) = sin2i as well as the initial conditions y(0) = 0, y'(0) = 0. Solution. It follows from the above exercise 7.H.8 that
s2£(y)(S) + A£(y)(s) = £(sin2i)(s). We also have
9
£(sin2i)(s)
i. e.,
The inverse transform leads to
y(t) = ±sin2i- \tcos2t . □
8.M.3.   Find the function y (t) which satisfies the differential equation
y"(t) + 6y'(t) + 9y(t) = 50 sin t and the initial conditions y(0) = 1, y'(0) = 4. Solution. The Laplace transform yields
s2C(y)(s)-s-4+Q(sC(y)(s)-l)+9C(y)(s) = 50C(smt)(s) i. e.,
s2+4' 2
In the case of a general linear differential equation, a nonzero value of the differential operator D is wanted. Again, as for systems of linear equations or linear difference equations, the general solution of this type of (non-homogeneous) equations
D(y) = b(t),
for a fixed function b(t), is the sum of an arbitrary solution of this equation and the set of all solutions of the corresponding homogeneous equation D(y)(t) = 0. The entire space of solutions is a finite-dimensional affine space, hidden in the huge space of functions.
The methods for finding a particular solution are introduced in concrete examples in the other column. In principle, they are based on looking for the solution in a similar form as the right-hand side is, or the method of variation of the constants.
8.3.19. Matrix systems with constant coefficients. Before leaving the area of differential equations, con-sider a very special case of first-order systems, whose right-hand side is given by multiplication of a matrix A e Mat„(R) of constant coefficients and an n2-dimensional unknown matrix function Y (t):
(1)
Y'(t) =A-Y(t).
Clearly this is a strict analogy to the iterative models in chapter 3.
Combine knowledge from linear algebra and univariate function analysis to guess the solution. Define the exponential of a matrix by the formula
B(t) = e*
00 j-k
k=0
The right-hand expression can be formally viewed as a matrix whose entries bij are infinite series created from the mentioned products. If all entries of A are estimated by the maximum of their absolute values || A\\ = C, then the fc-th sum-mand in bij(t) is at most jjnkCk in absolute value. Hence, every series bij(t) is necessarily absolutely and uniformly convergent, and it is bound above by the value etnC. Differentiate the terms of the series one by one, to get a uniformly convergent series with limit A etA. Therefore, by the general properties of uniformly convergent series, the derivative
d ,
(s2 + 6s + 9)£(y)(s) =
50
dt
= Ael
s2 + l
50
+
+ s +10, s + 10
(s2 + l)(s + 3)2     (s + 3)2' Decomposing the first term to partial fractions, we obtain
50 _ As + B      C D
(s2 + l)(s + 3)2 ~ s2 + l + 7+3 + (s + 3)2'
50 = (As + B)(s + 3)2 + C(s2 + l)(s + 3) + D(s2 + 1).
also equals this expression. The general solution of the system (1) is obtained in the form
Y(i)=etA-Y0,
where Yq £ Mat„(R) is the arbitrary initial condition Y(0) = Y0. The exponential etA is a well denned invert-ible matrix for all t. So we have a vector space of the proper dimension, and hence all solutions to the system (1). Notice that in order to get a solution, it is necessary to multiply by Y0 from the right.
588
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Substituting s = — 3, we get
50 = WD   hence   D = 5 and confronting the coefficients at s3, we have 0 = A + C,   hence   A = -C. Confronting the coefficients at s, we obtain
0 = 9,4+ 6B + C = 8,4 + 65, hence B = ^C. Finally, confronting the absolute term, we infer
50 = 9B + 3C + D = 12C + 3C + 5 hence   C = 3, B = 4, A = -3.
Since
s + 10     s + 3 + 7       1 7
+
It is remarkable that dealing with a vector equation with a constant matrix A e Mat„(R),
(2)
y'(t)=A-y(t),
(s + 3)2      (s + 3)2      s + 3    (s + 3)2' we have
...     -3s+ 4      3 5 1 7
for an unknown function y : R —> Rn, then the columns of the matrix exponential etA provide n linearly independent solutions. The general solution is then given by linear combinations of them.
The general solutions of the system (2) may be under-fjj,, stood better by invoking some linear algebra - the Jordan canonical form of linear mappings, see e.g.3.4.10. ► In terms of vector fields X, the system has the linear ]S 1 expression X(y) = <P(y) where <P is the linear mapping with the matrix A in coordinates. Clearly linear transformations of the system lead to another vector field with such linear description, since the differential of a linear mapping is the mapping itself.
Any linear transformation of coordinates with the (constant) matrix T transforms the system into
y
(Ty)' = (TAT-1) ■ (TY) =A-y.
s2 + l s + 3 (s + 3)2 s + 3 (s + 3)' -3s 4 4 12
+
+
+
s2 + l    s2 + l    s + 3    (s + 3)2' Now, the inverse Laplace transform yields the solution in the form
y(t) = -3cosi + 4sini + 4e-3* + 12/Je"3* .
□
8.M.4. Find the function y (t) which satisfies the differential equation
y"(t) =cos(7rf)-y(i),    te(0, +oo) and the initial conditions y(0) = ci, y'(0) = c2. Solution. First, we should emphasize that it follows from the theory of ordinary differential equations that this equation has a unique solution. Further, we should recall that
C (/") (s) = s2£ (/) (s) - S[lim/(i) - tHm/'(f)
and
£(cos(&f))(s) = p45T, beR. Applying the Laplace transform to the given differential equation then gives
s2£ (y) (s) - sci - C2 = ^2^72 - C (y) (s),
i. e.,
(D  £^(s)-(a2 + i)(a2 + ^)+jfl + ^fl-Therefore, it suffices to find a function y which satisfies (1). Performing partial fraction decomposition, we obtain
_s_ _ _1_
(S2 + 1)(S2+7T2)   _ 7T2-
In particular, a suitable change of coordinates T provides the matrix A in the Jordan canonical form expressing <P as a sum of two commuting linear mappings <P = 4>d + <Pn with 4>d di-agonalisable and <Pn nilpotent. Moreover, the decomposition of the nilpotent part into the sum of cyclic nilpotent mappings provides the Jordan blocks
(J\)d + (J\)n
	1	■ °\
1°	a	. o\
I:		1
	0	. a/
	0	■ °\
1°	a	. o\
I:		
	0	. a/
+
Splitting the system (2) into block-wise diagonal form splits also the space of the solutions generated by the exponential etA into the corresponding blocks (all the powers Ak enjoy the same block structure). So we can work with the matrix A already in the form on one such block J\ = (J\)d+ (J\)n-But for any two commuting matrices C and D, the exponentials etc and etD commute. So the exponential etD of the nilpotent D = (J\)n can be computed as the finite sum
E + t
_l-----1_
(fc-l)!
1   \S2 + 1 S2+TT2
where k is the size of the block and E is the identity matrix. The solution of the corresponding matrix system is
Y(t)=extE-eDt=exteDt= °
\ 0 0 ... eAt /
Finally, k independent solutions can be written down by inspecting the individual columns in Y(t).  Notice that the
589
CHAPTER 8. CALCULUS WITH MORE VARIABLES
The above expression of C (cos (bt)) (s) and the already proved formula
£(sinf) (s) = pr^j-then yield the wanted solution
y{ť) = 7r21_1 (cosi — cos (iri)) + ci cos t + c2 siní .
□
8.M.5.   Solve the system of differential equations
x"(t) + x'(t) = y(t) - y"(t) + eť,   x'(t) + 2x(t) =
-y(í) + y'(í) + e-ť
with the initial conditions a;(0) = 0, y(0) = 0,x'(0) = 1, y'(0) = 0.
Solution. Again, we apply the Laplace transform. This, using
£(e±ť)(*) = ^ transforms the first equation to
s2£ (x) (s) — s lim x(i) — lim x'(i) + sC (x) (s) — v '   '     t^o+ t^o+
lim x(t) = = t^o+  v '
C (y) (s)-(s2£ (y) (s) - Sjlimy(í) - ^J/'W) +^ and the second one to
sC (x) (s) - lim x(t) + 2C (x) (s) =
= -C (y) (s) + sC (y) (s) - flim y(t) + ^.
Evaluating the limits (according to the initial conditions), we obtain the linear equations
s2C (x) (s)-l + sC (x) (s) = C (y) (s) - s2C (y) (s) + ^ and
sC (x) (s) + 2C (x) (s) = -C (y) (s) + sC (y) (s) + ^ with the only solution
£ (x) (s)
2s-l
2(s-l)(s+l)
3s
2-1-12 •
2(s2-l
Once again, we perform partial fraction decomposition, getting
!_ (X) (b) — 8 s_x "I" 4 (s+l)2      8 s+1 — 4 (s+1)2 + 4 s2-l ■
Since we have already computed that
£(te-*)(S) = j^ry,   C (sinhf) (s) =
2s
(s2-1)2 '
we get
£(ísinhí)(s) - 2s = § fe~* + j sinhí,   y(t) = f f sinhf.
We definitely advise the reader to verify that these functions of x and y are indeed the wanted solution. The reason is that the Laplace transforms of the functions y = e*, y = sinhf and y = t sinh t were obtained only for s > 1). □
canonical basis (ei,..., e^) provides just the chain of vectors with D(ek) = ek-i, k = 2,...,k, while D(ei) = 0. Now the k independent solutions are:
: eAt ei
yi(t)
V2{ť) =eAt(iei + e2)
Vk(t)
(fc-l)!ei + (fc-2)!
The result can be easily transferred back to the original coordinates in which the system (2) was given. Finding the decomposition of the space into Jordan blocks and finding the chains of basis vectors v{ realizing the cyclic nilpotent components, we arrive at the independent solutions by replacing the ej by Vi.
The findings are summarised in the following theorem, one of many attributed to Euler.
Theorem. All solutions of the system (2) are linear combinations of those in the form combining exponential and polynomial expressions
k
y(t) = eMY.PiV
3=0
where k is the order of nilpotency of the Jordan block corresponding to the eigenvalue A of the matrix A, pj are suitable constant vectors. In particular, if the nilpotent part of A is trivial, then k = 0.
This important result allows many generalizations. For example, the Floquet-Lyapunov theory generalizes this behaviour of solutions to systems with periodically time-dependent matrices A(t).
8.3.20. Return to singular points. Finally, recall the first-order matrix system in paragraph 8.3.12 when the derivative of the solutions of vector equations with respect to the initial conditions was discussed. Consider a differentiable vector field X(x) defined on a neighbourhood of its singular point x0 e K™, i.e. X(x0) = 0. Then, the point x0 is a fixed point of its flow Flf (x).
The differential (p(t) = Dx Flf (x0) satisfies the matrix system with initial condition (see (2) on page 579)
$'(t) = D1X(x0)-$(t), ${0)=E.
The important point is that the differential D1X is evaluated along the constant flow line x0, since this is a singular point of the system.
The solution is known explicitly, and this describes the evolution of the differential <P(t) of the vector field's flow at the singular point x0,
<P(t) = etA,   A = D1X{x0).
This is a useful step in analysing the qualitative behaviour in a neighbourhood of the stationary point x0.
Te2 H-----1- iefc_i + ek)
590
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.M.6. Find the solution of the following system of differential equations:
x'(ť) = -2x(t) + 3y(t) + 3t2,
y'(t) = -Ax(t) + 5y(t) + e\   x(0) = 1, y(0) = -1 Solution.
C(x')(s) = C(-2x + 3y + 3i2)(s), C{y'){s) = C{-Ax + 5y + é){s).
The left-hand sides can be written using (1), while the right-hand sides can be rewritten thanks to linearity of the £ operator. Since £(3i2)(s) = Jr and£(e*)(s) = we get the system of linear equations
sC(x)(s) - 1 = -2C(x)(s) + 3£(y)(s) + sLZ(y)(s) + 1 = -ALZ(x)(s) + 5LZ(y)(s) + ^.
(x0,yo) =
In matrices, this is A(s)x(s) = b(s), where A(s)
Cramer's rule says that
= W' £(y)(s) = W' where
|A| = |Ai| = |A2| =
s + 2 -3 A     s - 5
W
= sz - 3s + 2 -3
+
s + 2     1 + Jr 4 +
= (s-5)(l + ^) + 3(-l + 3é = (s + 2)(-l + ^)-4-f.
Hence,
C(x)(s) =
£(y){s)
(s-l)(s-2) 1
(*-!)(*-2)
(s-5)(s3 + 6) gs-2 s3 s — 1
(s + 2)(2-s)    4s3+ 24 s — 1 s3 Decomposing to partial fractions, the Laplace images of the solutions can be expressed as follows:
15 _ 87 s3       4s '
Consider the Lotka-Volterra system from this point of view. Use the coordinates (x, y) and parameters a, j3,7, S exactly as in 8.3.10. In particular all these quantities are assumed to be positive.
The vector field in question is
X{x, y) = (x(a - I3y, y(-j + /35x)). So there is a single singular point
J_ «y
5/3' (3>
and the differential of X at this point is
-/fco wo -r
5f3x0 -7 J     \aS 0
The determinant of the characteristic polynomial of the latter matrix is A2 + 7a and so there are two complex conjugated roots A = ±2^/07. As is known from linear algebra, such a matrix describes a rotation in suitable coordinates. Compute the real and imaginary components of the eigenvectors corresponding to A (as developed in linear algebra), to obtain the
matrix solution in the form
1 + — \
s\   ].    /   cos y'cry t — sin y/ajt
The columns are the two independent solutions y(t) (they differ just by the phase of the linearly distorted rotation).
This might be useful information for further analysis of the model around its singular point. For example, the parameter (3 does not appear explicitly here, while S influences the distortion of the flow lines from being circles. Compare this -jlvith the illustrations on page 574. For a first approximation, guess that the sizes of the populations of both the prey and the predator oscillate regularly around the values of the singular point if the initial conditions are near this point.
8.3.21. A note about Markov chains. In the third chapter, we dealt with iterative processes, where the stochastic matrices and Markov processes determined by them played an interesting role. Recall that a matrix A is stochastic if the sum of each of its columns is one. In other words,
(1 ... 1)-^=(1 Take the exponential etA to obtain
1)
LZ(x)(s) —     2s2       (s_l)2 + s_l 4(s_
C(x)(s) = —-If — 7vrrj2 + ^rr —       —     — T"-Now, the inverse transform yields the solution of this Cauchy problem:
x(i) = -ft - 3tet + 28e* - f e2* - ft2 - f, y(t) = -18t - 3tef + 27e* - 7e2* - 6t2 - 21 . □
(1
1)
00 j-k
Eh'1
k=0
1) -Ak = e*(l ... 1).
N. Numerical solution of differential equations
Now, we present two simple exercises on applying the Euler method for solving differential equations.
Therefore, for every t, the invertible matrix
B(t) = e"* etA
is stochastic, if A is stochastic. Add stochastic initial conditions B0, to get the flow B(t) = e~* etA -B0, which is a continuous version of the Markov process (infinitesimally) generated by the stochastic matrix A.
Differentiating with respect to t, yields
i-B{t) = -e~l etA -Bo+e-1 AetA -B0 = (—E + A)B(t),
591
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.N.I. Use the Euler method to solve the equation y' = —y2 with the initial condition y(l) = 1. Determine the approximate solution on the interval [1,3]. Try to estimate for which value h of the step is the error less than one tenth. Solution. The Euler method for the considered equation is given by
yk+i =Vk~h-yl
for
xo = 1,   yo = 1,   xk = x0 + k-h,   yk = y{xk)-
We begin the procedure with step value h = 1 and halve it in each iteration. The estimate for the "sufficiency" of h will be made somewhat imprecisely by comparing two adjacent approximate values of the function y at common points, terminating the procedure if the maximum of the absolute difference of these values is not greater than the tolerated error (0.1).
The results
h0 = l	
y-°- = (10 0)	
hx = 0.5	
y(i) = (l   0.5   0.375 0.3047	0.2583)
Maximal difference: 0.375.	
h2 = 0.25	
y(2) = (1.0000   0.7500 0.6094	0.5165 0.4498
0.3992   0.3594 0.3271	0.3004)
Maximal difference: 0.1094.	
h3 = 0.125
y(3) = (1.0000   0.8750   0.7793   0.7034 0.6415 0.5901   0.5466   0.5092   0.4768 0.4484 0.4233   0.4009   0.3808 0.3627 0.3462   0.3312 0.3175) Maximal difference: 0.0322. Using suitable software, the following graphical representation of the results can be obtained, where the dashed curve corresponds to the exact solution, which is the function y = 1 jx.
so the matrix B(t) is the solution of the matrix system of equations with constant coefficients
Y'(t) = (A-E)-Y(t)
with the stochastic matrix A.
This can be explained intuitively . If the matrix A is stochastic, then the instantaneous increment of the stochastic vector y(t) in the vector system with the matrix A, y'(t) = A ■ y(t), is again a stochastic vector. However, it is desired that the Markov process keeps the vector y(t) stochastic for all t. Hence, the sum of increments of the particular components of the vector y(t) must be zero, which is guaranteed by subtracting the identity matrix.
As seen above, the columns of the matrix solution Y'(t) create a basis of all solutions y'(t) of the vector system.
Much information can be obtained about the solutions by using some linear algebra. For example, suppose that the matrix A is primitive, that is, suppose one of its powers has only positive entries, see 3.3.3 on page 163. Then its powers converge to a matrix ^4oo, all of whose columns are eigenvectors corresponding to the eigenvalue 1.
Next, estimate the difference between the solution Y'(t) for large t and the constant matrix A^. There are two consequences from the latter convergence. First, there exists a universal constant bound for all powers \\Ak — A^W < C. Second, for every small positive e, there is an N e N such that for all k > N, \\Ak - A^ \\ < e. Hence,
<e"« ^CPooll+e-VPool
k<N
Let t   —>   oo.    The limit of the expression j(t) =
e~* Y^k<N IT can easily be computed by iterative application of l'Hospital's rule. Differentiation of the sum yields the same, only for N smaller by one, and the derivative in the denominator is not changed, so the limit is zero. Therefore, for fixed e, there is a time T such that j(t) would be less than e for t > T. The whole expression has been estimated (for n > N and t > T > 0) by the value
£((7+1)11^11.
Summarizing, a very interesting statement is proved, which resembles the discrete version of Markov processes:
592
CHAPTER 8. CALCULUS WITH MORE VARIABLES
□
8.N.2. Using the Euler method, solve the equation y' = —2y with the initial condition y(0) = 1 and step value h = \. Explain the phenomenon which occurs here and suggest another procedure.
Solution. In this case, the Euler method is given by
yk+i =yk-h-2yk = -yk.
For the initial condition yo = 1, we get the alternating values ±1 as the result. This is a typical manifestation of the instability of this method for large step values h. If the step cannot be reduced for some reasons (for instance, when processing digital data, the step value is fixed), better results can be achieved by the so-called implicit Euler method. For a general equation y' = f(x, y), that is given by the formula
yk+i = yk + h- f(xk+1,yk+i).
In general, we thus have to solve a non-linear equation in each step. However, in our problem, we get
yk+i = yk-2h- yk+i,
so wehaveyfc+i = ^yk for/i = 1. Again, the obtained results can be represented graphically, including the exact solution of the equation.
□
Continuous processes with a stochastic matrix
Theorem. Every primitive stochastic matrix A determines a vector system of equations
y'(t) = (A-E)-y(t)
with the following properties:
• the basis of the vector space of all solutions is given by the columns of the stochastic matrix
Y(t)=e-1 etA,
• if the initial condition yo = y(to) is a stochastic vector, then the solution y(t) is also a stochastic vector for all values oft,
• every stochastic solution converges for t —» oo to the stochastic eigenvector yoo of the matrix A corresponding to the eigenvalue 1.
8.3.22. Remarks on numerical methods. Except for the exceptionally simple equations, for example, linear equations with constant coefficients, analytically solvable equations are seldom encountered in practice. Therefore, some techniques are required to approximate the solutions of the equations.
Approximations have already been considered in many other situations. (Recall the interpolation polynomials and splines, exploitation of Taylor polynomials in methods for numerical differentiation and integration, Fourier series etc.). With a little courage, consider difference and differential equations to be mutual approximations. In one direction, replace differences with differentials (for example, in economical or population models). For other situations the differences may imitate well continuous changes in models.
Use the terminology for asymptotic estimates, as introduced in 6.1.16. In particular, an expression G(h) is asymptotically equal to F(h) for h approaching zero or infinity, and write G(h) = 0(F(h)), if the finite limit of G(h)/F(h) exists.
A good example is the approximation of a multivariate function f(x) by its Taylor polynomial of order k at a point x0. Taylor's theorem says that the error of this approximation isO(||/i||fe+1), where h is the increment of the argument h =
X — Xq.
In the case of ordinary differential equations, the simplest scheme is approximation using Euler polygons. Present this method for a single ordinary equation with two quantities: one independent and one dependent. It works analogously for systems of equations where scalar quantities and their derivatives in time t are replaced with vectors dependent on time and their derivatives. This procedure was used before in the proof of the Peano's existence theorem, see 8.3.8.
Consider an equation
y' = f(t,y)
593
CHAPTER 8. CALCULUS WITH MORE VARIABLES
with continuous right-hand /. Denote the discrete increment of time by h, i.e. set tn = t0 + nh. It is desired to approximate y(t). It follows from Taylor's theorem (with remainder of order two) and the equation that
y(tn+1) = y(tn) + y'(tn)h + 0(h2)
= y(tn) + f(tn,y(tn))h + 0(h2).
Define the recurrently the values yj by the first order formula
Vj+i = vj + f(yj,Vj)h-
This leads to the local approximation error 0{h2), occuring in one step of the recurrence.
If n such steps are needed with increment h from to to t = tn, the error could be up to nO(h2) = \{t—t0)O(h2) = 0(h). More care is needed, since the function /(f, y) is evaluated in the approximate points (f j, ) and the already approximate previous values yj. In order to keep control, f(t,y) must be Lipschitz in y. Assuming inductively that the estimate is true for all i < j,
\f{t3,y{tj)) - f{tj,Vj)\ < C\y{tj) - Vj\ <C\t- t0\O(h)
where C is the Lipschitz constant, assuming that the error does not exceed 0(h) with globally valid constant for yj. Inductively, the expected bound 0(h) for the global error estimate is obtained. Think about the details.
The Euler procedure is the simplest method within the class of the Runge-Kutta methods.
Dealing with higher order equations, we may either view them as vector valued first order systems (as in the theoretical column) and then even Euler method provides results for the initial condition on the necessary number of derivatives in one point. But in practical problems, it is often needed to find solutions passing through more then one prescribed point. For example, with second order equations, prescribe two values y(h) and y(t2) of the solution. This would need completely different methods.
594
CHAPTER 8. CALCULUS WITH MORE VARIABLES
O. Additional exercises to the whole chapter
8.O.I. A basin with volume 300 hi contains 100 hi of water in which 50 kg of salt is dissolved. Water with 2 kg of salt per 1 hi starts flowing into the basin at 6 hl/min. The mixture, being kept homogeneous by permanent stirring, leaves the basin at 4 hl/min. Express the amount of salt (in kg) in the basin after t minutes have expired as a function of the variable t e [0,100].
0
8.O.2. During a controlled experiment, a small smelting furnace is slowly cooling down while the outer temperature keeps at 300 K. The experiment began at noon. At 1 pm, the temperature in the furnace was estimated at 1300 K. At 3 pm, it was only 550 K. Supposing the measurements were accurate, compute what the temperature in the furnace was at 2 pm. O
8.O.3. The half-life of the radioactive sulfur isotope 35S is 87.5 days. After what period are there only 900 grams left of the original amount of 1 kilogram of this isotope? (You may express the result in terms of the natural logarithm.) O
8.O.4. The half-time of a radioactive element A is 5 years; for an element B, it is 1 year. If we have 5 kg of element B and 1 kg of element A, after what period will we have the same amount of both? (You may express the result in terms of the natural logarithm.) O
8.O.5. The half-time of a radioactive element A is 8 years; for an element B, it is 2 years. If we have 3 kg of element B and 1 kg of element A, after what period will we have the same amount of both? (You may express the result in terms of the natural logarithm.) O
8.O.6. The half-life of the radioactive cobalt isotope 60 Co is 5.27 years. Having 4 kg of this isotope, after what period does
1 kg of it decay? (You may express the result in terms of the natural logarithm.) O
8.O.7. Solve the following differential equation for the function y = y(x):
H- y2
y
l+x2
O
8.O.8. Determine all solutions of the following equation with separated variables:
y -y2 +xy' = 0.
o
8.O.9. Solve the equation
1 + f =e*.
o
8.O.10. Solve the equation 2y = x3y'. Q 8.O.H. Determine all solutions of the equation
\JA — y2 dx + y dy = 0.
o
8.0.12. Solve
y'tanrr = y2 + 1 — 2y.
595
CHAPTER 8. CALCULUS WITH MORE VARIABLES
o
8.0.13. Determine the general solution of the differential equation
* +1 _    y i x l—y2
8.0.14. Find the general solution of the differential equation
(x + 1) dy + xy dx = 0.
o
o
8.0.15. Find the solution of the differential equation
sin y cos xdy = cos y sin x dx
which satisfies 4y(0) = tt. O 8.0.16. Solve the initial problem
(x2 + 1) (y2 - 1) + xyy' = 0,    y(l) = y/2.
o
8.0.17. Determine the particular solution of the equation
y' sin x = y In y
which goes through the point [tt/2,6\. O 8.0.18. Find all solutions of the differential equation
2(1 + 6*) 2/y' =e*,
which satisfy y(0) = 0. O 8.0.19. Solve the homogeneous equation
(xy' — y) cos ^ = x.
o
8.O.20. Determine the general solution of the homogeneous differential equation y3 = x3y'. O 8.0.21. Find all solutions of the equation
xy' = yjx2 - y2 + y.
o
8.0.22. Determine the general solution if we are given
xy' = y cos (in ^) .
o
8.0.23. Solve the equation (x + y) dx — (x — y) dy = 0 as homogeneous. O S.0.24. Calculate y' = (x + y)2. O
596
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.0.25. Find the general solution for
/ = x-y+3 y       x+y —5
8.0.26. Calculate
/ = x-y+l
8.0.27. Determine all solutions of the differential equation
/ _ 5y-5x-\ y  ~ 1y-1x-\-
8.0.28. Find the general solution of the equation
i _ x-y-l y x+y+3'
8.0.29. Determine the general solution for the equation
/ = 2x-y-5 "       x—3y—5'
8.O.30. Express the solutions of the equation
, _ x+2y-T
y ~ x-3
8.0.36. Find all solutions of the equation
y' cos x = (y + 2 cos x) sin 597
x.
o
o
o
o
o
as explicitly given functions. o 8.0.31. Using the method of constant variation, calculate y' + 2y = x. o 8.0.32. Determine the general solution of the equation y' = 6x + 2y + 3. o 8.0.33. Solve the linear equation
y' = 4xy + (2x + l)e2x2.
o
8.0.34. Solve the equation y'x + y = x In x. o 8.0.35. Calculate the linear differential equation
y'x = y + x2\nx.
o
CHAPTER 8. CALCULUS WITH MORE VARIABLES
o
8.0.37. Find the solution of the equation y' = Qx — 2y which satisfies the initial condition y(0) = 0. O 8.0.38. Solve the initial problem
y' + y sin a; = sin a;,    y (J) = 2.
o
8.0.39. Find the solution of the equation y' = Ay + cos x which goes through the point [0,1]. O
8.O.40. Solve the following equation for any a, b e K:
xy'+y = ex,    y (a) = b.
8.0.41. Determine the general solution of the equation
3a;2y' + xy = 4_-
8.0.42. Solve the Bernoulli equation
/ 3    — t
y = xy — ye
8.0.43. Calculate the Bernoulli equation
iv 2 •
y — - = y smi.
8.0.44. Find all solutions of the equation
y' = ^+ayy.
8.0.45. Solve the equation
a;y' + 2y + a;5y3ex = 0.
8.0.46. Find the general solution of the following equation provided a, b > 0:
ydy = (a^ + b^ dx.
8.0.47. Interchanging the variables, solve
2y + (y2 - 6a:) y' = 0.
o
o
o
o
o
o
o
598
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.0.48. Solve the equation
2y In y+y-x '
8.0.49. Calculate the general solution of the following equation:
x dx =       — y3 j dy.
8.O.50. Interchanging the variables, calculate
(x + y)dy = ydx + y\ny dy.
8.0.51. Solve
y' (e~v -x) = l.
8.0.52. Calculate
2x-y2
8.0.53. Solve the equation
2ydx + xdy = 2y3 dy.
8.0.54. Calculate
o
o
o
o
o
o
o
y" + 3y' + 2y = (20a; + 29) e3x.
o
8.0.55. Find any solution of the non-homogeneous linear equation
y" + y' + | y = 25 cos (2x).
o
8.0.56. Determine the solution of the equation
y" + 2y' + 2y = 3e_:Ecosa;.
599
CHAPTER 8. CALCULUS WITH MORE VARIABLES
o
8.0.57. Find the solution of the equation
y" = 2y' + y + l,
which satisfies y(0) = 0 and y'(0) = 1. O 8.0.58. Find the solution of the equation
y" = Ay - 3y' + 1
which satisfies y(0) = 0 and y'(0) = 2. O 8.0.59. Determine the general solution of the linear equation
y" — 2y' + 5y = 5e2x sin a;.
8.O.60. Taking advantage of the special form of the right-hand side, find all solutions of the equation
y" + y' = x2 — x + 6e2x.
8.0.61. Solve
y(4) - 2y" + y = 8 (ex + e~x) + A (sina; + cos:
8.0.62. Using the method of constant variation, calculate
y"-2y' + y=ei
8.0.63. Solve
y" + Ay' +Ay = e~2x \nx.
o
o
o
o
o
8.0.64. Using the method of constant variation, find the general solution of the equation
y" + Ay= -TTT.
O
8.0.65. Solve the equation y" +y = tan2 x. O 8.0.66. Find the solution of the differential equation
yP) = -2y" - 2y' - y + sin(x) which satisfies y(0) = -\, y'(0) = ^, and y"(0) = -1 - ^. O 8.0.67. Calculate the equation y'" - 2y" - y' + 2y = 0. O
600
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.O.68. Find the general solution of the equation
y W + 2y" + y = 0.
o
8.0.69. Solve
3/(6) + 2i/5) + 4y W + Ay'" + 5y" + 2y' + 2y = 0.
o
8.O.70. Find the general solution of the linear equation
y(5) _ 3y(4) + 2y>" = 8x- 12.
o
601
CHAPTER 8. CALCULUS WITH MORE VARIABLES
Key to the exercises
8.C1. 2.
8.C.2. Factorize the denominator to get \. 8.C.3. 0.
8.C.4. Expand the fraction with ^ and use the substitution t = xy ( (x, y) —> (0, 2), means t —> 0). 2. 8.C.5. Use the polar coordinates x = r cos p,y = r sin ip, (x, y) —> (oo, oo), means r —> oo, ip e (0, f). 0.
8.C.6. Use two different parametrization y = 1 — ex and y = x to get two different values —4 and 0, thus the limit does not exists. 8.C.7. Try polar cordinates, where r —> 0, <p e (0, 27r). The value is cos 2</>, that is it depends on the direction of approach to [0,0], thus the limit does not exists.
8.C.8. Try polar coordinates, where r —> oo, <p e (0, §). 1 pro <p = f, ale pro <p    f vyjde 0, takže limita neexistuje.
8.C.9. V2.
8.C.10. 2.
8.C.11. 0.
8.C.12. 0.
S.C./3. 0.
8.C.14. 0.
S.C./5. e.
8.C.16. Neexistuje.
8.C.17. Neexistuje.
8.C.18. Choose y = kx2.
8.C.20. The circle fc([0,0];l).
8.C.21. The set {[x, x + (2k + l)f ]; x e R, k e Z}.
8.C.22. The function is continuous everywhere including the point [0,0].
8D.3. p = {[s, f + f, 1 - tts]; s e R}.
8D.14. a) Compute the differential of the function /(x, y) = arctan ^ in [1,1] and get f + 0,035.
b) f(x, y) = ln(x2 + y2), P = [xo, yo] = [1,0], the result is-0.06.
c) f(x, y) = arcsin |, xo = 0.5, yo = 1, f - ^4f.
d) f(x, y) = xy, xo = 1,2/0 = 2, 1, 08.
8.D.18. [4/3,1/6] and [-1/3, -2/3], equations are x + j/ = 3/2 and x + j/ = -1. 8.D.19. [-7,1] and [9, -3], equations are x = -7 and x = 9. 8D.20. [V2,1/V2,-1/V2, V2] and [-V2,-1/V2, l/\/2,-V2\.
8D.25. z'x(l, V2) = 2z'_-^2v = 0. 4(1, v^) = 2fllZ% = 0. z^(l, v^) = 4'„(1, v^) = -2, 4y(l, v^) = 0.
SJÍ.26. 4(-2,0) = -jjfg^ =0,4 (-2,0) = -j-^L- =0,4',(-2,0) = <y(-2,0) = A z»y(_2,0) = 0.
8.E.3. As we can easily see, the Taylor polynomial of a function which is a polynomial is the function itself, or the function cut of the
higher powers. Therefore, in this case, we have T(x,y,z) = xz2 + xy + 1.
8.E.4. T(x, y) = y2. The tangent plane is given by the linear part of the Taylor polynomial, i. e., z = 0. Therefore, the given point does not lie in it.
8.E.5. T2n(xy+1)(1,1) = ln(2) + \(x2 + y2 + xy - x - y - 1). 8.E.6. y + xy.
8.F.8. Stationary points: (±|, =Fl, ±5). The Hessians are indefinite in both cases, no extrema. 8.F.9. Stationary points: [=p2, ±1, ±2]. The Hessians are indefinite in both cases, no extrema. 8.F.10. Stationary points: [±|, ±1, =f|]. The Hessians are indefinite in both cases, no extrema. 8.F.11. Stationary points: [±2, =p2, ±1], no extrema.
8.F.12. Stationary points: (0,-1/4), (±\/3,-1); aminimum occurs at (0,-1/4). 8.F.13. Stationary point: (0, —1/2). The Hessian is indefinite, no extrémům.
602
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.F.14. A global minimum at (1/7,-2/7).
8.F.15. Stationary point: (—1/9, 2/9). The Hessian is indefinite, no extremum.
8.H.17. At the point [ |, \, — | ], minimum.
8.H.18. At the point [-^2, §].
8.H.19. [§ + ^,-■$§].
8.H.20. [f + $fl-
8.H.21. At the points [± -75, =f r/g ] •
8.H.22. At the points [± ^, - ^ ].
8.H.23. 3^3/16.
8.H.24. l/(2v/6).
8.H.25. (l/sqrtl, 1/2), (-l/sqrt2, -1/2). S./.23. [0, f]. 8.1.24. l^,1}. S./.25. [0,£].
S.Z.26. [*f,£].
S./.27. F = tt. S./.2S. 8tt.
8.1.29. ±fi.
S./.30.   4^377 - ^7T.
8.1.32. f(17vT7-l). S./.33. ^(tt/4-1/2). S.K7.j/ = l-4.-re (0,2). 8.K.8.Me-'/q. 8.K.9.     sin (2t).
8.O./. 2 (100+ 2i)-TIi|^F. S.0.2. 800 K.
S.0.3. -87.5^2^1 = 13.3 days.
8.0.5. |M2> years. 8.0.6. 5.27^|.
8.O.7. y =        ■ (use the sum formula for the tangent function). 8.O.8. y = Q,y = {l- Cx) _1, Cel. 8.0.9. j/ = — In (1 — Cex), Cel.
8.O.10.y = Ce-l/x\Ce R.
8.0.//. j/ = 2, y = -2, (x - Cf + y2 = 22, C e R. &OJJ.y=l,y=l-H,^^,C6R. S.0./3. x2 + 2 In I x I + In | y2 - 1 | = C, C e R. S.0./4. y = C (x + 1) e" , Cel. 8.0.15. 1/2 cos j/ = cosx.
603
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.O.I6. y = y/^L + i.
8.0.17.y = ean(x/2).
8.O.I8. y = ±y/\n (e* + 1) - In 2.
8.0.19.x = CeBin*,C eR. 8.O.20. y2 = x2 + Cx2y2, Cel.
8.0.21. y = x,y = -x,y = xsin (In | Cx |), C £ R \ {0}. 8.0.22. cot (i In f) = In | Cx |, C £ R \ {0}. 8.0.23. arctan f = In (x2 + y2)+C,C e R. S.0.24. j/ = tan (x + C) - x, C £ R.
S.0.25. C = (x - l)2 - 2(j/ - 4)(x - 1) - (j/ - 4)2, C £ R \ {0}.
8.0.26. (x - y)2 + 2x + C = 0, C £ R.
S.0.27. j/ = x, C = 5x - 2y + In | y - x\, C £ R.
S.0.2S. (x + l)2 - 2(x + l)(y + 2) - (y + 2)2 = C, C £ R \ {0}.
S.0.29. 3(y + l)2 - 2(j/ + l)(x - 2) + 2(x - 2)2 = C, C £ R \ {0}.
8.O.30. y = 5-x + C\x- 3)2, C £ R.
S.0.3/. j/ = Ce"3* + I x - i, C £ R.
8.0.32. y = Ce2x - 3(x + 1), C £ R.
S.0.33.2/ = (x2 + x + C) e2"2, C £ R.
S.0.34.2/ = f + ^ - f, C £ R.
8.0.35. y = Cx + x2 In x - x2, C £ R.
S.0.36.2/ = sin2 3:+c', C £ R.
a cos 3; '
8.0.37. y = 3x + § e^2* - |, C £ R.
S.0.3S.2/ = ecosa: + l.
S.0.39. y = ± sin x - ^ cos x + f± e4a:.
8.O.40.y= eX+aj"ea.
S.0.4/. 2/3 = ln|"J+C,,C£R.
8.O.42.y = 0, y2 = £^,CeR.
8.0.43. y = 0, i = £ + cos x - sisiE, C £ R.
S.0.*/. 2/ = 0, y = x4 (\ In | x | + C) 2, C £ R.
8.0.45. y = 0,2T2 = x4 (2ea: + C), C £ R.
8.0.46. y2 + -\ = Ce " ^, C £ R.
S.0.47. x = 4 + OA C £ R.
8.0.48. x = 2/In 2/ + f, C £ R.
S.0.49. x2 + 2/2 (y2 -C) = 0, C £ R.
S.0.50. x = 2/In 2/ - + Cy, C £ R.
S.0.5/. x = (C + y)       C £ R.
8.0.52. x=4 + f + i+ Ce2y, C £ R.
8.0.53.X = f 2/3 +      C £ R.
S.0.54.2/ = Cie"2" + C2e"a: + (x + l)e3a:, Ci, C2 £ R. 8.0.55. E. g., y = 8 sin (2x) - 6 cos (2x).
604
CHAPTER 8. CALCULUS WITH MORE VARIABLES
8.0.56. y = <z~x (Ci cos x + C2 sin x) + ^ eT* sin x, Ci, C2 e r. 8.0.57.
y = ie(1+^x _|_ Igil-^Jo; _ j o /i ro 3  3;       7    —Ax 1
8.0.58. 2/ — 5e     20       ~ I-
8.0.59. y = Ciď cos (2x) + C2&x sin (2x) + e2a: (sin x - \ cos x), where Ci, C2 e r. 8.0.60.2/ = Ci + C2&-x + § x3 - § x2 + 3x + e2", Ci, C2 e r.
8.O.6/. j/ = (Ci + C2x) ex + (Cs + C4x) e_a: + x2 (e* + e~x) + cosx + sinx, Ci, C2, Cs, C4 e r.
8.0.62.2/ = Cie* + C2xea: + xe* (In | x | - 1), Ci, C2 e r.
8.0.63.2/ = Cie-2* + C2xe-2x + ^ <r2x lnx -      e~2x, Ci, C2 e r.
8.0.64. For Ci, C2 e r, we have 2/ = -f cos (2x) + I sin (2x) In | sin (2x) | + Ci cos (2x) + C2 sin (2x). 8.0.65. y = Ci cosx + C2 sinx - 2 + I sinxln |i±si!iH I d c2 e r.
8.O.66. y(x) = -e~x + e~ix sin(-^x) + e~ix cos(-^x) - \ sin(x) - \ cos(x). 8.0.67. y = Ciex + C2zTx + Cse2a\ Ci, C2, Cs e r.
8.O.68. y = Ci cos x + C2 sin x + C3X cos x + C4X sin x, where the constants Ci, C2, C'3, Ca e r. 8.0.69. 2/ = (Ci + Csx + Cse^) cosx + (C2 + C4x + Cee-*) sinx, Ci, C2, Cs, C4, Cs, Ce e r.
8.O.70. y = Ci + C2X + Csx2 + Cm2x + Cse* +     where the constants Ci, C2, Cs, Ci, Cs e r.
605
CHAPTER 9
Continuous models - further selected topics
The World might be discrete in reality.
But the continuous models are useful anyhow .
A. Exeterior differential calculus B. Applications of Stoke's theorem 9.B.I. Compute
(x — y)dx + x dy,
where c is the positively oriented curve represented by the perimeter of the square ABCD with vertices A = [2,2]; B =
[-2, 2]; C= [-2, -2]; D= [2,-2].
Solution. Using Green's theorem (see ??), we reduce the given curve integral to an area (multiple) integral. The integral is of the form J f(x, y) dx+g(x, y) dy, where f(x, y) =
c
x — y and g(x, y) = x. The needed partial derivatives of the functions f(x,y) and g(x,y) axe thus fy(x,y) = —1 and 9x(x, y) = 1. All of the functions f(x, y), g(x, y), fy(x, y), and gx(x,y) axe continuous on R2, so we can use Green's theorem:
This chapter presents several glimpses towards more serious applications of the differential and integral calculus. We cannot be ambitious in covering the displayed topics extensively. Thus, after reasonably detailed introduction to more geometric approach to the differential and integral calculus, we present rather quick surveys and comments on partial differential equations, variational calculus, and complex analysis. We hope the readers will get excited at least by some of them and look for further resources themselves.
1. Exterior differential calculus and integration
We have already seen how to optimize functions on subsets in R , but how to integrate quantities over such domains? For example, if we have 2-dimensional membrane in R3 and we know the infinitesimal flow of some liquid through it, how to compute how much went through within a time interval?
In order to understand such questions properly, we formalize the concept of the level sets Mf, from the paragraph 8.1.23 devoted to the implicit functions and provide a geometric explanation of the integration process. Then we quite easily arrive at several powerful tools, including the Stokes theorem, which is a higher dimensional extension to the fundamental theorem of (univariate) calculus, and the Frobenius theorem which generalizes the integration of prescribed line elements into a solution of an ODE to higher dimensions.
9.1.1. Vector fields and differential forms. Let us come back to the concept of tangent vectors and vector fields, cf. 8.3.14 where we introduced the J^fa,^ tangent space TU = U x R™ as the set of all -^r^isg?*— possible tangent vectors at the points of an open subset PcR". There is the projection p : TU —> U assigning the foot points to the tangent vectors, we write TXU for the vector space of all vectors X with p(X) = x at a point x e U, and we use the notation X(U) forthesetof all smooth vector fields on the open subset U.
The linear combinations of the special vector fields g|-admitting smooth functions as the coefficients generate the entire X(U). Thus we write general vector fields as
(x — y)dx + xdy -
(1 + l)dxdy = 2 l l dxdy
X(x)=Xl{x)^- + .
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
2 / / dxdy=2[x]2_2- [y]!2 = 32
□
9.B.2. Compute
x  Ax + xy Ay,
where c is the positively oriented curve going through the vertices A = [0,0]; 5= [1,0];C= [0,1].
Solution. The curve c is the boundary of the triangle ABC. The integrated functions are continuously differentiable on the whole R2, so we can use Green's theorem:
1 -x+l
xAdx + xydy = j j y Ax Ay = j    \   y Ax Ay
l
d
-x+l
0 0
1
xA 2x2
Ax ■
+ x
x2 - 2x + 1
da;
□
9.B.3. Calculate
(xy + x + y) Ax + (xy + x — y) Ay,
where c is the circle with radius 1 centered at the origin.
Solution. Again, the prerequisites of Green's theorem are satisfied, so we can use Green's theorem, which now gives
(xy + x + y) Ax + (xy + x — y) Ay
As we know, every differentiable mapping f : u —> v between two open sets U C R™, V C Rm defines the mapping F* : TV -> TV by applying the differential D1f to the individual tangent vectors. Thus if y = f(x) =
(/i(z),...,/m(x))then
fJj2m*
_d_
dx.
j = l ^i=l ' J
When we studied the vector spaces in chapter two, we came accros the useful linear forms. They were defined in paragraph 2.3.17 on page 109. This idea extends naturally now. A scalar valued linear mapping defined on the tangent space TXU is such a linear form at the foot point x. The vector space of all such forms T*U = (TXU)* is thus naturally isomorphic to Rn* and the collection T*U of these spaces comes equipped by the projection to the foot points, let us denote it again by p. Having a mapping r\ : U C R™ —> T*U on an open subset V,por\ = idjj, we talk about a differential form rj on U, or a linear form.
Every differentiable function / on an open subset U C R™ defines the differential form df on U. We use the notation i?1 (U) for the set of all smooth linear differential forms on the open set U.
In the chosen coordinates (x1,..., xn) we can use the differentials of the particular coordinate functions to express every linear form r\ as
i](x) = i]1(x)dx1 H-----\-r)n(x)dxn,
where r\i (x) are uniquely determined functions. Such a form r] evaluates on a vector field X(x) = X1(x)-^ + ■ ■ ■ + Xn(x)AL- as
ri(X)(x) = ri(x)(X(x)) = rll(x)X1(x)+- ■ ■+rln(x)Xn(x).
If the form r\ is the differential of a function /, we get just back the expression
X(f)(x) = df(X(x)) = ^X,(x) + ■■■ + ^-Xn(x)
for the derivative of / in the direction of the vector field X.
y + 1 — x — lAxAy
d 1 2ir
r2 (sin ip — cos p) Ar Ap
o o
1 2ir
= I r  Ar I sin p — cos pdp
0 0
= ^ [ - cos V - sin <p] 2q =0.
□
9.1.2. Exterior differential forms. As we discussed already in chapters 1 and 4, the volume of fc-dimensional parallelepipeds S, as a quantity depending of the k vectors spanning S, is an antisymmetric fc-linear form on the vectors, see 2.3.22 on page 114. Remember also the computation of the volume of parallelepipeds in terms of determinants in 4.1.22 on page 248.
Thus, if want to talk about the (linearized) volume on k-dimensional objects, we need a concept which will be linear in k distinct tangent vector arguments and will assign a scalar quantity to them. Moreover, we will require that interchanging any pair of arguments swap the sign, in accordance with the orientations.
607
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.B.4.   Compute f(2e2xsiny — 3y3)dx + (e2xcosy +
|a3)<iy,   where  c is the positively  oriented ellipse
Ax2 + 9y2 = 36.
Solution.: We will use Green's theorem, choosing   the   linear   deformation   of   polar coordinates
x = 3rcosip, £ [0, 27r],
y = 2r sin p r £ [0,1],
leading to (the Jacobian of the transformation is 6r):
4
(2e2x siny - 3y3)dx + (e2x cosy + -a;3) dy =
2e2xcosy + 4a2 - (2e2xcosy - 9y2) da dy : 6r [4(3rcosy)2 + 9(2rsiny>)2] =
d
1 2ir
0 0
1 2ir 3
= 216 J r3 dr J dp = 216 ■ [—\Q ■ 2ir = 108tt.
o o
□
9.B.5. Compute
J(exlny - y2a)da +        - ^x2yj dy,
c
where c is the positively oriented circle (a — 2)2 + (y — 2)2 1.
Solution.
ex 1
(exlny - y2a)da + (---x2y)dy =
V ^
d 1 2ir
e e
--ay---h 2ay da dy :
y y
r(r cos ip + 2) ■ (r sin y> + 2) dr dip =
o o
1 2ir
r3 sin y> cos p + 2r2 (sin y> + cos ^) + 4r dr ^
o o
2ir
— f sin p cos y> dijS + — / sin y> + cos p dp + Air =
1 rsin   p1 2ir      r .      n 2ir
g [   2   J0  + [-cos^ + siny>J0  +47T = 47r.
Exterior differential forms
Definition. The vector space of all fc-linear antisymmetric forms on a tangent space TXU, U £ R™, will be denoted by Ak(TxU)*. We talk about exterior k-forms at the point a £ U.
The assignment of a fc-form r/(a) £ AkTxU to every point x e U from an open subset in R™ defines an exte-rior differential k-form on (7. The set of smooth exterior fc-forms on U is denoted fJ^ (U).
Next, let us consider a smooth mapping G : V —> [7 between two open sets V £ Rm and U £ R™, an exterior fc-form r)(G(x)) £ Ak(TG^Kn), and choose arbitrarily k vectors Ai (a),..., Xk(x) in the tangent space TXV. Just like in the case of linear forms, we can evaluate the form rj at the images of the vectors X{ using the mapping y = G(x) = (gi(x),..., gn(x)). This operation is called the pullback of the form rj by G.
G*(r/(G(a)))(A1(a),...,Afc(a))
= r/(G(a))(G,(A1(a)),..., G*(Xi(a))).
In the case of linear forms, this is the dual mapping to the differential D1G. We can compute directly from the definition that, for instance,
d
G*(dyA(—)=dyt(G4—))
and so
(1)        G*(dVi) =
dx1
■dx1 H-----h
dgi dxk '
dxj.
which extends to the linear combinations of all dy{ over fuc-tions.
Another immediate consequence of the definition is the formula for pullbacks of arbitrary fc-forms by composing two diffeomorphisms:
(2) (GoF)*a = F*(G*a).
Indeed, as a mapping on fc-tuples of vectors,
(G o F)*a = a o ((D1G o D1F) x ... x (D1G o D1F)) = G*(a) o (D1F x ... x D1F) = F* o G*a
as expected.
9.1.3. Wedge product of exterior forms. Given a fc-form
, Q £ AkRn* an(j an i_{orm p e AkRn*,
we can create a (k + £)-form a A (3 by all possible permutations a of the arguments. We just have to alternate the arguments in all possible orders and take the right sign each time:
(aAI3)(X1,...,Xk+e) =
y~] sign(o-)a(ACT(1), Xa{k-) )/3(ACT(fc+1), Xa{k+e-)).
□
kW.
608
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.B.6. Calculate the integral
(ex siny — xy2)dx + ( ex cosy — ^
where c is the positively oriented circle x2+y2+4x+4y+7 =
o. O
9.B.7. Compute
f(3y - eshlx) dx + (7x + vV + 1) dy,
where c is the positively oriented circle x2 + y2 = 9. O Compute the integral
fA y3,    A    9   x3. ,
/ - + 2xy - 2- )dx + - + x2 + — dy, J   x 3 y 3
where c is the positively oriented boundary of the set D =
{(x,y) e R2 : 4 < x2 +y2 < 9, ^ < y < V^x}. O
9.B.9. Remark. An important corollary of Green's theorem is the formula for computing the area D that is bounded by a curve c.
1 r
m(D) — 2     ~y dx + x dy.
9.B.10.   Compute the area given by the ellipse ^ + fr = 1 ■
Solution. Using the formula 9.B.9 and the transformation
x = a cos t,y = b sin t, we get for i £ [0, 2ir] that
m(D) = — J —y dx + x dy =
c
2tt 2tt
= — I acost ■ bcostdt--/ &sini ■ (—asiat)dt =
2j 2 J K '
o o
2tt 2tt 1        f       2 1        f 2
— t:0^ / cos tdt + —ab / sin tdt =
1      f     2 2 1
= —ab I cos t + sin tdt = —ab2ir = irab,
which is indeed the well-known formula for the area of an ellipse with semi-axes a and b.
□
It is clear from the definition that a A /3 is indeed a (k + £)-form. In the simplest case of 1-forms, the definition says that
(q A 0)(X, Y) = a(X)P(Y) - a(Y)P(X). In the case of a 1-form a and a fc-form (3, we get
(aAf3)(X0,X1,...,Xk) =
k
i=o
where the hat indicates omission of the corresponding argument. The wedge product of finitely many forms is denned analogously (either directly by a similar formula, or we can notice that the above wedge product of forms is an associative operation - think this out by yourselves!).
Next, remind the generators g|- of all vector spaces in X(Rn), as well as the generators dx{ of all linear exterior forms in i?1 (Rn). Their wedge products
^ii...2fc    dxj-^ A ■ ■ ■ A dx,[k
with fc-tuples of indices i\ < i2 < ■ ■ ■ < ik generate the whole space fJ^ (Rn) by linear combinations with functions standing for coefficients. Indeed, interchanging a pair of adjacent forms in the product merely changes the sign, so the whole expression is identically zero if an index appears twice. Therefore, every fc-form a is given uniquely by functions aii ...ik (x) in the expression
a(x) =    ^2   ail...ik(x)dxil A ■ ■ ■ A dxik.
il<---<ik
Consequently, the vector spaces Ah (T*U) are the trivial zero spaces if k > dim U. Thus, fJ^ (U) contains only the trivial zero form in this case.
Another straightforward consequence of the definition is that the pullback of the wedge product by a smooth mapping G : V -> U satisfies
G*(a A 13) = G*a AG* 13.
We should also notice that 0-forms i7° (Rn) are just smooth functions on R™. The wedge product of a 0-form / and a fc-form a is just the multiple of the form a by the function /. Similarly, the top degree forms in S71 (U) are all generated by the single generator £12...n, since there is just one possibility of n different choices among n coordinates, up to the ordering. This means that actually the n-forms cu are identified with functions via the formula
u>(x) = f(x)dxi A ■ ■ ■ A dxn.
At the same time, while the pullback on the functions / G fi°(C/) by a transformation F : Rn ->• Rn, y = F(x), is trivial, i.e. F*f(x) = f(y) = f o F(x), a straightforward computation reveals
(1)    F*u(x) = det(D1F)(x)f(F(x))dx1 A ■ ■ ■ A dxn for all cu = fdy1 A ■ ■ ■ A dyn.
609
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.B.11. Find the area bounded by the cycloid which is given parametrically astp(ť) = [a(t—siní); a(l—cosi)],fora > 0, t G (0, 2-k), and the x-axis.
Solution. Let the curves that bound the area be denoted by t_ and c2. As for the area, we get
m(D) = \ JCl -V d-x + x dy + I JC2 -y clx + x dy.
Now, we will compute the mentioned integrals step by step. The parametric equation of the curve t_ (a segment of the x-axis) is (i; 0); t G [0; 2a7r], so we obtain for the first integral that
-y dx + x dy = - /      0 ■ 1 di + /     t ■ 0 di = 0.
C-
The parametric equation of the curve c2 is ip(ť) G (a(t — siní), a(l - cosí));í G [27r; 0].
The formula for the area expects a positively oriented curve, which means for the considered parametric equation that we are moving against the parametrization direction, i. e. from the upper bound to the lower one.
We thus get for the area of the cycloid that o
If If
— / —y dx + x dy = — / a(t — siní) ■ a(siní) dí—
a(l — cosi) ■ a(l — cosi) dí :
1
2tt 2tt
a   It sin t — sin ř — 1 + 2 cos t — cos t dt ■■
1 ,
= — a   I tsiní + 2cosí — 2 dí =
= —a2 [—í cos í — siní + 2 cos í — 2]2lr = 3ira2
□
9.B.12.   Compute I   =   jjx3dydz   + y3 dx dz + s
z3 dx dy,where S is given by the sphere x2 + y2 + z2 = 1.
Solution. It is advantageous to work in spherical coordinates
x = psimpcosifj p = [0,1],
y = p sin ip sin ip p = [0,tt],
z =(ocosi(s ip = [0, 2tt],
9.1.4. Integration of exterior forms on Rn. Once we fix
r.y> _ coordinates (xi,..., xn) on R™ (e.g. the standard ones), there is the bijection between functions / and top degree forms .:{<>/•■• ui(x) = f(x)dx1 A ■ ■ ■ A dxn. This can be interpreted as defining the scale with which the standard volume in R™ is to be taken pointwise due to the function /.
Notice, that changing the coordinates via a tranformation F will rescale this understanding of the forms exactly as in the formula for coordinate substitution in the Riemann integral. We should view this observation as a new interpretation of the integrands in our earlier procedure of integration of functions / on Riemann measurable open subsets [/ C I", independent of any coordinate choice.
Let us check this interpretation in more detail. First, we define the n-form wi», giving the standard n-dimensional volume of parallelograms, i.e. in the standard coordinates we obtain
ui» = dxi A ■ ■ ■ A dxn. If we want to integrate a function f(x) "in the new way", we consider the form cu = fuiM.™ instead, i.e. cu = f(x)dx1 A ■A dxn. We define the integral of the form cu as
uj = I f(x)dxi A ■ ■ ■ A dxn = / f(x) dxi ■ ■ ■ dxn, <u     Ju Ju where the Riemann integral of a function is considered on the right-hand side.
Let us point out, that the n-form cu on the left-hand side is well defined, independently of any choice of coordinates. If we want to express the form cu in different coordinates using a diffeomorphism G : V —> U, G = (__,..., gn), it means we will evaluate a; at a point G(y) = x at the values of the vectors G*(Xi),..., G*,(Xn). However, this means we will integrate the form G*cu in coordinates (yi,..., y„), and we already saw in the previous paragraph, cf. 9.1.3(1) that
(G*a;)(y) = /(G(y)) det(D1G(U))dy1 A ■ ■ ■ A dyn.
Substituting into our interpretation of the integral, we get
G*(/o;Kn) = /     /(G(y)) det(D1G(U))dy1 ■ ■ ■ dyn, v Jg--(u)
which is, by the theorem 8.2.8 on the coordinate substitution in the integral, the same value as fv fuiM.™ if the determinant of the lacobian matrix is positive, and the same value up to the sign if it is negative.
Our new interpretation thus provides the geometrical meaning for the integral of an n-form on R™, supposing the corresponding Riemann integral exists in some (and hence any) coordinates. This integration takes into account the orientation of the area we are integrating over. We shall come back to this point in a moment.
9.1.5. Integrals along curves. Our next goal is to integrate 'fit       objects over domains which are similar to curves or surfaces in R3. Let us first shape 5£2|£ge_£ our mind on the simplest case of the lowest dimension, i.e. the curves in R™.
610
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
The Jacobian of this transformation is — p2 sin p. The given integral is then equal to
I =
dy dz + y  dx dz + z  dx dy =
3x2 + 3y2 + 3z2 dx dy dz
= 3
0 0 0 „2 „„„2 ,
+p2 cos2 p) dp dp dip
1  2rr rr
= 3
pA sin ip (sin2 ip (cos2 ip + sin2 ip) +
ooo
„2 ,
4
2tt-
Recall the calculation of the length of a curve in R™ by univariate integrals, which was discussed in paragraph 6.1.6 on page 379. The curve was parametrized as a mapping c(t) : R —> R™, and the size of the tangent vector \\c'(t)| was expressed in the Euclidean vector space. This procedure was given by the universal relation for an arbitrary tangent vector, i.e., we actually found the function p : R™ —> R which gave the true size when evaluated at c'(t). This mapping satisfied p(a v) = \a\p(v) since we ignored the orientation of the curve given by our parametrization. If we wanted a signed p2 sin p (p2 sin2 p cos2 ip + p2 sin2 p sin2 ip+ length, respecting the orientation, then our mapping p would
be linear on every one-dimensional subspace L c R™. Of course we could have multiplied the Euclidean size by a positive function and integrate this quantity.
In view of our geometric approach to integration, we should rather integrate linear forms along curves, while the size of vectors is given by a quadratic form, rather than a linear one. However, in dimension one, we take the square root of the values of the (positive definite) quadratic form, in order to get a linear form (up to sign) which is just the size of the vectors.
Let us proceed in a much similar way dealing with linear differential forms r\ on R™. The simplest ones are the differentials df of functions / on R™.
In order to motivate our development, let us consider the following task. Imagine, we are cycling along a path c(t) in R2, the function / is the altitude of the terrain. If we want to compute the total gain of altitude along the path c(t), we should "integrate" the immediate infinitesimal gains, which should be the derivatives of / in the directions of the tangent vectors to the path, i.e. df(c'(t)).
Thus, let us consider a differentiable curve c(t) in R™, t e [a, b], write M for the image c([a, &]), and assume that a differentiable function / is defined on a neighborhood of ,M. The differential of this function gives for every tangent 'vector the increment of the function in the given direction. It is expressed by the differential of the composite mapping foe
+ cos ipj dp dp dip =
pA sin p dp dp dip
ml*
-1-11 = -^.
cos p\0
□
9.B.13. The vector form of the Gauss-Ostrogradsky theorem. The divergence of a vector field F(x, y, z) = f(x,y,z)-^ + g(x,y,z)-§^ + h{x,y,z)-§-z is denned as divX := fx + gy + hz. Then, the Gauss-Ostrogradsky theorem can be formulated as follows:
divF(x, y, z) dx dy dz = II F(x,y, z)-n(x,y, z) dS,
where n(x, y, z) is the outer unit normal to the surface S at the point [x, y,z] e S (S is the boundary of the normal domain V).
d{foc){t) = ^L{c{t))c'1{t) +
We can thus try to define the value of the integral in the following way
9.B.14.   Find the flow of the vector field given by the func-
tion F = (xy2, yz l,z =3.
over the cylinder x2 + y2 = 4, z =
df =
M
dx1
+ -^-(c(t))c'n(t))dt,
and we immediately verify that the change of the parametrization of the curve has no effect upon the value. Indeed, writing
c(t) = c(ip(s)), a = ip(ci), b = ip(b), our procedure yields
Solution. First of all, vector field:
we compute the divergence of the
divF = V-F = (
d(xy2)    d(yz) d(x2z)
dx
- +
dy
+ -
dz
2 2
y +z +x .
dj_
dx1
dxn
and the theorem about coordinate transformations for univari
d.
ate integrals gives just the same value if we have ^ > 0, i.e.,
611
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Therefore, the flow T of the vector filed is equal to
y2 + z + x2 dx dy dz =
2  2ir 3
p ■ (p2 sin2 ip + z + p2cos2p) dp dp dz =
if we keep the orientation of the curve, and the same value up to sign if the derivative of the transformation is negative.
If we extend the same definition to an arbitrary linear form i] = r]idxi + ... i]ndxn we arrive at the same formulae with rji replacing the derivatives J^-,
77 = / (r11(c(t))c'1(t) + --- + rln(c(t)yn(t))dt,
0 0 1 2  2i 3
p ■ (p (sin p + cos p) + z ) dp dp dz
0 0 1 2  2i 3
p ■ (p (sin p + cos p) + z ) dp dp dz
0 0 1 2  2i 3
p + pz dp dp dz =
0   0 1
2 3 3    4 2
= 2tt J J p3 + pz dp dz = 2tt J[£- + ^-z]2 dz =
0   1 1
3
= 2./4 + 2zdz=2.[4z + z2]3 = 2.[12 + 9-4-l]
□
9.B.15. Find the flow of the vector field given by the function F = (y, x, z2), over the sphere x2 + y2 + z2 = 4.
M
again independent of the parametrization of the curve c as above.
In the above example with n = 2, / was the altitude of the terrain, and the integral of df along the path modelled the total gain of elevation. Thus, we should expect that the total gain along the path should depend on the values c(a) and c(b) only, while different curves with the same boundary points would produce different integrals of r\ for a general 1-form rj. This will indeed be the special claim of the Stokes theorem below.
Before we treat the higher dimensional analogs, we shall look at more abstract approach to suitable subsets in Rn and the role of coordinates on them.
9.1.6. Manifolds. The straightforward generalizations of parameterized curves c(t) : R —> Rn are the differ-entiable mappings p : V C Rfe —> R™, k < n, with injective differential dp(u) at every point of its open domain V. Such mappings are called immersions.
With the curves, we did not care about their self-intersections etc. Now, for technical reasons, we shall be more demanding.
Solution. The divergence of the given vector field is:
div£ = V.F=(| + ^ + ^) = 2z.
ax     ay oz
Thus, the wanted flow equals
2z da; dy dz = v ooo
2 2tt tt
= 2 J p3 dp J dtp J sin p cos p dp =
0 0 0
= 2[£]0 ■ mi- ■ [i^vo =
= 2 ■ — ■ 2tt ■ 0 = 0.
4
Manifolds in R™
A subset M C R™ is called a manifold of dimension k if every point x e M has a neighborhood U C R™ which is the image of a diffeomorphism p : V x V ^ Rn,V cRk, V C Rn~k, such that
• the restriction p = p|y : K^Misan immersion,
• p-^M) = V x {0} C Rn. p2 sin p ■ 2pcos p dp dp dip =rhe manifolds M are carrying the topology inherited from
□
C. Equation of heat conduction
9.C.I. Find the solution to the so-called equation of heat conduction (equation of diffusion)
./img/0214_eng.png
612
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
ut(x,t) = a2 uxx(x,t), t > 0
satisfying the initial condition       u (x,t) = f(x)-
Notes: The symbol ut = |^ stands for the partial derivative of the the u with respect to t (i. e., differentiating with respect to t and considering x to be constant), and similarly,
Ft2
uxx = %^ denotes the second partial derivative with respect to x (i. e., twice differentiating with respect to x while considering t to be constant). The physical interpretation of this problem is as follows: We are trying to determine the temperature u(x, t) in an thermally isolated and homogeneous bar of infinite length (the range of the variable x) if the initial temperature of the bar is given as the function /. The section of the bar is constant and the heat can spread in it by conduction only. The coefficient a2 then equals the quotient f^, where a is the coefficient of thermal conductivity, c is the specific heat and q is the density. In particular, we assume that a2 > 0.
Solution. We apply the Fourier transform to the equation, with respect to variable x. We have
oo
F(ut)(u,t) = ^= J ut(x,t)e-^xdx =
— OO
where differentiated with respect to t, i. e.,
T(ut) (uo,t) = (T(u) (uo,t))' = (Jr(u))t (uo,t).
At the same time, we know that
F (a2 Uxx) (w, t) = a2 T (uxx) (u, t) = —a2oj2 T (u) (oj, t).
Denoting y(ui, t) = T (u) (ui, t), we get to the equation
yt = -a2u2 y.
We already solved a similar differential equation when we were calculating Fourier transforms, so it is now easy for us to determine all of its solutions
y(uo,t) = K (a/le""2"2*,   K (uo) G R.
It remains to determine K(cu). The transformation of the initial condition gives
T'(f) (lj) = lim T (u) ((jJ,t) = lim y(cj,t) = K (cj)e° = K (uj),
hence
y(u,t) = T(f)(ui)e-a2u)2t,   K (ui) G R.
Now, using the inverse Fourier transform, we can return to the original differential equation with solution
This definition is illustrated by the picture above. Manifolds can be typically (at least locally) given implicitly as the level sets of differentiable mappings, see paragraph 8.1.23 and the discussion in 8.1.25.
The mapping ip from the definition is called the local parametrization or local map of the manifold M. The manifolds are a straightforward generalization of curves and surfaces in the plane R2 or the space R3. We have excluded curves and surfaces which are self-intersecting and even those which are self-approaching.
For instance, we can surely imagine a curve representing the figure 8 parametrized with a mapping ip with every where-injective differential. However, we will be unable to satisfy the second property from the manifold definition in a neighborhood of the point where the two branches of the curve meet.
Tangent and cotangent bundles of manifolds
The tangent bundle TM of the manifold M is the collection of vector subspaces TXM c TxRn which contain all vectors tangent to the curves in M. There is the footpoint projection p : TM -4- M.
Similarly, the cotangent bundle T*M of the manifold M is the collection of the dual spaces (TXM)*, together with the footpoint projection.
./img/0214_eng.png
Clearly, every parametrization <p defines a diffeomor-phism
p.* : TV —> T(p(V)) C TM,   p«(J(t)) = j^(c(t)).
Due to the chain rule, this definition does not depent of the choice of the representing curve c(t). We shall also write Tip for the mapping ip*.
In particular, the local maps p (extended to p, as in the above definition) induce the local maps ip* : TV = f/xRL TM cl"xt™ of the tangent bundle. Thus, the tangent bundle TM is again a manifold, which locally looks as U xl* over sufficiently small open subsets U C M. But we shall see that TM might be quite different from M xRk globally. Dealing with the cotangent bundle, we can use the dual mappings (Tp-1)* on the individual fibers T*M to obtain local parametrizations.
613
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
u (x, t) =        J y(u, t) &iu)x dw =
— OO
OO — OO
OO     / OO \
Computing the Fourier transform .F(/) of the function f(t) = e~a*2 for a > 0, we have obtained (while relabeling the variables)
7b j"""
cp e-jrp dp = -i
; e  4=,    c > 0.
/2ir   J    ~ ~ ^ %/5c
— OO
According to this formula (consider c = a2t > 0, p = ui, r = s — x), we have
l    r   -a2uj2t -Ms -x) du) = _i   e—i^r
V2ir   J %/2a2t '
— OO
Therefore,
OO j   _ j2
□
Notice that two differentiable immersions <p and V parametrizing the same open subset U C M provide the composition ip~x o ip. We view this as a coordinate change for U and we have just seen that coordinate changes on M induce coordinate changes on TM.
Further, if M and TV are two manifolds and F : M —> TV a mapping, we say that F is differentiable (up to order r or smooth or analytic), if the compositions ip-1 o F o p with two local parametrizations p of M and ip of N (of the same order of differentiability as we want to check) is differentiable (up to order r or smooth or analytic). Again, the chain rule property of differentiation shows that this definition does not depend on the particular choice of the parametrizations.
Each differentiable mapping F : M —> TV defines the tangent mapping TF : TM —> TN between the tangent spaces, which clearly is differentiable of order one less than the assumed differentiability of F.
Vector fields and differential forms on manifolds
Smooth vector fields X on a manifold M are smooth sections X : M -> TM of the footpoint projection p : TM -> M.
Smooth fc-forms r\ on a manifold M are sections M -w Ak(TM)* such that the pullback of this form by any parametrization V —> M yields a smooth exterior fc-form on V.
We write X (M) for the space of smooth vector fields on M, while fJ^ (M) stays for the space of all smooth exterior fc-forms on M.
Notice that all our coordinate formulae for the vector fields, forms, pullbacks etc. on Rm hold true in the more abstract setting of manifolds and their local parametrizations.1
9.1.7. Integration of exterior forms on manifolds. Now, we are almost ready for the definition of the in-(N '-w-P      tegral of fc-forms on fc-dimensional manifolds.
For the sake of simplicity, we will examine smooth forms ui with compact support only. First, let us assume that we are given a fc-dimensional manifold M C R™ and one of its local parametrizations p : V C Rfe -4- U C M C R™. We consider the standard orientation on Rfe given by the standard basis (cf. 4.1.22 for the definition of the orientation of a vector space). The choice of the parametrization p also fixes the orientation of the manifold U C M. This orientation will be the same for those choices of local parametrizations, which differ by diffeomor-phisms with positive determinants of their Jacobi matrices. The orientation will be the other one in the case of negative determinants. The manifold M is called orientable if there
1 Actually, instead of dealing with manifolds as subsets of M", we might use the same concept of local parametrizations of a space M with differentiable transition functions i/>-1 of. We just need to know what are the "open subsets" in M, thus we could start at the level of topological spaces. On the other hand, there is the general result (the so called Whitney embedding theorem) that each such abstract n-dimensional manifold can be realized as embedded in the M2", so we essentialy do not loose any generality here.
614
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
is a covering of the entire set M by local parametrizations p such that their orientations coincide.
Therefore, we apparently have exactly two orientations on every connected orientable manifold. Fixing either of them, we thereby restrict the set of parametrizations to those compatible with this orientation. From now on, we will always proceed in this fashion, and we will talk about oriented manifolds only.
Next, let us fix a form uj with compact support inside the image of one parametrization U C M of an oriented manifold M. The pullback form p* (uj) is a smooth fc-form on V C Rk with compact support. The integral of the form cu on M is defined in terms of the chosen parametrization which is compatible with the orientation as follows:
uj =   I p*(uj).
M JRk
If we choose a different compatible parametrization^ = poip where ip is a diffeomorphism ip : W —> V C Kfe, we can easily compute the result, following the same definition. Let us denote
p*{uj){u) = f(u)dui A ■ ■ ■ A duk.
Invoking the relation 9.1.2(2) for the pullback of a form by a composite mapping, we get
uj =      <P*{w) = / ip*{ip*uj)
M il* il*
ip* (/ dui A ■ ■ ■ A duk) f(ip(v)) det(D V) (v)dv! ■ ■ ■ dvk.
jRk
This is again the same value as JRk p*cu.
This proves the correctness of our definition of the integral JM cu provided the integrated fc-form has compact support lying in the image of a single parametrization.
However, typical manifolds M are given by implicit equations. For example, x2 + y2 + z2 = 1 defines the surface of the unit ball, i.e., the sphere S2 C K3. If we want to integrate an exterior 2-form on S2, we will have to use several parametrizations. Fortunately, our definition of the integral is additive with respect to disjoint unions of integration domains. Therefore, if we can write
M = Ux U U2 U ■ ■ ■ U Um U B,
where Ui are pairwise disjoint images of parametrizations pi, and B is a set whose inverse image in any parametrization is a Riemann measurable set with measure zero, we can compute
uj = I   uj + ■ ■ ■ +
M JU-l J Urr,
and we can easily verify that this value is independent of the choice of the sets Ui and the parametrizations (in particular, we need not be worried by the set B since the result of any integration on it is zero). For example, we can imagine splitting a sphere to the upper and lower hemispheres, leaving the equator B uncovered.
615
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
When calculating in practice, we usually divide the entire manifold into several disjoint open areas with compact closures, and we integrate on each of them separately. However, this procedure still does not help if we stick with the strict assumption that the entire support of integrated form has to be inside of one parametrization. Thus, we will develop a global definition of the integral, which is more advantageous from the technical/theoretical point of view (although it usually does not help in computations directly).
9.1.8. Partition of unity. Consider a manifold M c R™ and one of its covers by open images Ui of parametrizations ipi. We can surely find a countable cover of each manifold M (it suffices to realize that we can do with parametrizations which map the origin to points with rational coordinates in Rn). Furthermore, we shall assume that any point in x e M belongs to only finitely many sets Ui. Such a cover is called a locally finite cover by parametrizations ip^2
Now, recall the smooth variants of indicator functions from paragraph 6.1.6. For every pair of positive numbers e < r, we constructed a function /£iI. (t) of one real variable t such that fe,r(t) = 1 for \t| < r - e, while fs,r{t) = 0 for \t\ > r + e, and 0 < /£iI. (t) < 1 everywhere. At the same time, we had f(t) ^ 0 if and only if \t\ < r + e. Next, if we define
Xr,£,x0(x) = fE,r(\x - X0\),
then we get a smooth function which takes the value 1 inside the ball Br-e(x0), with support exactly Br+e(x0), and with values between 0 and 1 everywhere. moznaobr. char.fce
Lemma (Whitney's theorem). Every closed set K c R™ is the set of all zero points of some smooth non-negative function.
Proof. The idea of the proof is quite simple. If K = R™, the zero function fulfills the conditions, so we can further assume that K Rn.
The open set U = R™ \ K can be expressed as the union of (at most) countably many open balls Br% (xi), and for each of them, we choose a smooth non-negative function /, on R™ whose support is just Br% (xi), see the function Xr,e,x0 above. Now, we add up all these functions into an infinite series
oo
f(x) = ^2akfk(x),
k=l
where the positive coefficients ak are selected so small that this series would converge to a smooth function j(x).
To this purpose, it suffices to choose ak so that all partial derivatives of all functions akfk(x) up to order k (inclusive) would be bounded from above by 2~k. Then, not only the series J2kakfk is bounded from above by the series 2~2k^~k< hence by Weierstrass criterion, it converges uniformly on the
This property is called paracompactness and, actually, each metric space is paracompact. Thus in particular all our manifolds enj oy this property too. But we do not want to go into details of the proof.
616
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
entire Rn, but we get the same for all series of partial derivatives, since we can always write them as
a^      Cf Jk       _|_ y^flfc      Q Jk , ~    d^ii ''' 9x{ dx^. ''' dx{
k=0 k=r
where the first part is a smooth function as it is a finite sum of smooth functions, and the second part can again be bounded from above by an absolutely converging series of numbers, so this expression will converge uniformly to dx dydx ■
It is apparent from the definition that the function j(x) satisfies the conditions of the lemma. □
Partition of unity on a manifold
Theorem. Consider a manifold M C Rn equipped with a locally finite cover by open images Ui of parametrizations <Pi. Then, there exists a system of smooth non-negative functions f on the sets U such that for every point x G M, we have J2i fi(x) = 1, and fi(x)    0 if and only if x G U.
The system of functions f from the theorem is called the partition of unity subordinated to the locally finite cover of the manifold by the open sets U.
Proof. First, we extend the sets Ui to open sets Ui using the extended parametrizations (p, from the definition of manifold and its local parametrizations. We can surely do this in such a way that the sets Ui keep being a locally finite cover of an open neighborhood U = UtUt C Rn of the manifold M.
For every open set Ui, we can choose a non-negative function gi (x) on the whole Rn so that (x) ^ 0 exactly for x G Ui. This can be done by Whitney's theorem proved in the above Lemma. Now, the function g(x) = Y^i gi(x) is well-defined for all and smooth, thanks to the cover
being locally finite (for every fixed point x, it is a finite sum of non-zero functions on some of its neighborhoods). The function g(x) is positive for all x G M. Thus, instead of functions gi (x) restricted to M, we may rather consider the functions fi (x) = gi (x) J g(x), which already have both of the required properties of the theorem. □
9.1.9. Integration of fc-forms on manifolds. Now, we are ready for the definition of the integral of fc-forms on fc-dimensional manifolds. Let us . consider an oriented manifold Mcl" and a form ui G fiF (M) with compact support.
Let us choose a locally finite cover of the manifold M by parametrizations ipi : Vi —> U such that the closures of all images pi(Vi) are compact and, eventually, choose a partition of unity f subordinated to this cover. The integral is defined by the formula
/ ^ = / y^f^^y^ fiu>
JM JM   i i JUt
617
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
where the right-hand integrals have already been denned since each of the forms feu has support inside the image under the parametrization ipi (and they equal to JM feu for the same reason).
Actually, we can assume that our sum is finite, since it suffices to consider integral over the image of parametrizations covering the compact support of cu. Hence, it is a well-defined number, yet it remains to verify that the resulting value is independent of all our choices.
To this purpose, let us choose another system of parametrizations t/j : Vj —> Uj, again with compatible orientations, providing a locally finite cover of M. Let gi be the corresponding partition of unity. Then the sets Wij = UiDUj form again a locally finite covering and the set of functions fgj provide the partition of unity subordinated to this covering. We arrive at the following equalities:
Y f^^Y     ■f-(Y9j)ul = Y fi9iw
i   jM i   jM j i,j jM
Y 9^ = Y  9j(Y,fi)Lj = Y h9ju,
where the potentially infinite sums inside of the integrals are all locally finite, while the sums outside of the integral can be viewed as finite due to the compactness of the support of cu. Thus, we have checked that the choices of the partition of unity and the parametrizations do not influence the value of the integral.
9.1.10. Exterior differential. As we have seen, the differential of a function can be interpreted as a mapping
d : fi°(Rn) -> i?1 (Rn).
By means of parametrizations, this definition extends (in a coordinate free way) to functions / on manifolds M, where the differential df is a linear form on M. The following theorem extends this differential to arbitrary exterior forms on manifolds Mcl".
Exterior differential
Theorem. For all m-dimensional manifolds McK™ and k = 0,..., m, there is the unique mapping
d:[t{M)-+ nk+1 M,
such that
(1) d is linear with respect to multiplication by real numbers;
(2) for k = 0, this is the differential of functions;
(3) if a e if (M), ft arbitrary, then
d(a A/3) = (da) A f3 + (-l)ka A (df3);
(4) d(df) = Ofor every function f on M.
The mapping d is called the exterior differential. The equality d o d = 0 is valid for all degrees k.
618
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Proof. Each fc-form can be written locally in the form
a=    E o,i1--ikdxil A ■ ■ ■ A dxik.
il< — <ik
If the differential d exists, then by the required properties, it must be equal to
da =    E dail...ik Adxtl A-■ ■ Adxik
il<---<ik
^ \ -   doi....i. ,
=     > —-——dxi A dxi. A - ■ ■ A dxik.
i;ii<--<ik
Indeed, the generators dx{ of linear forms are in fact the differentials of the coordinate functions, so further differentiation must lead to zero by the last property, while we know the differential of functions. Further, we have d(f0) = dfA/3+fd/3 by property (3).
Thus, let us define the differential d in coordinates by the formula (5), and we are going to verify all of the required properties. We shall proceed in two steps.
First, we check the requirements in one coordinate patch. The first two requirements are obvious from the formula (5). It is enough to verify the property (3) for the special forms a = adxil A ■ ■ ■ A dxik and (3 = bdxj1 A ■ ■ ■ A dxjt, since then the property must hold for the sums of such forms too.
We compute
d(a A 0) = d(ab dx{1 A ■ ■ ■ A dx{k A dxj1 A ■ ■ ■ A dxje) =      -j^-bdxi A dxi1 A ■ ■ ■ A dxik A dxj1 A ■ ■ ■ A dxje
i 1
+      ~Qx~a      A dxix A ■ ■ ■ A dxik A dxj1 A ■ ■ ■ A dxjt
i 1
= da A (3 + (-l)ka Ad(3, as expected.
The last property can be again verified for simple forms a = a dx{l A ■ ■ ■ A dxik. Applying the formula (5) twice, we arrive at
d(da) = dy'y^j ~Qx~d,Xi A dxi1 A ■ ■ ■ A dxik i
dxidx
3
= d2    dx. /\ dXi /\ dXi^ /\ ... /\ dxik
E,   fa (fa .
yj^r~ - Tfr~Tfr~>dX^ A ^ A ^ A " ' A ^
v dxidxj dxjdxi
i<Cj J J
o,
where we used the fact that the wedge product dx{ A dx{ vanishes, dxj A dxi = —dxi A dxj and the second partial derivatives of functions are symmetric.
619
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
The second step in the proof is the verification, that the coordinate formula (5) correctly defines a differential operator on general manifolds M. In order to achieve this, it is sufficient to show that the coordinate expression of the exterior derivative commutes with the pullbacks of forms. Indeed, we may then define the differential operator around any point in any coordinates and the results will coincide.3
Thus, consider a change of coordinates G : U —> V, x = G(y) = (g1 (y),..., gm(y)), and compute G* (da) of an exterior form a = a dx{l A ■ ■ ■ A dxik (which gives the result for sums of such expressions too). This is straightforward:
G*(da) = j:^)G*(dxt)^(dxn)A-
-oG)(^yi+...)A(^yi+.
Now, notice d(G*(dxj)) = G*(d(dxj)) = 0 and thus d(G*(a)) = d((a o G)G*(dxn) A...) = d(a o G) A G*(dxil) A ■ ■ ■ A G*(dxik)
= E((^oG)(^ + -))AC?*dxilA...>
i
clearly the same expressions. □
9.1.11. Manifolds with boundary. In practical problems, we often work with manifolds M like an open ball in the three-dimensional space. At the same time, we are interested in the boundaries of these manifolds DM, which is a sphere in the case of a ball. The simplest case is the one of connected curves. It is either a closed curve (like a circle in the plane), then its boundary is empty, or the boundary is formed by two points. These points will be considered including the orientation inherited from the curve, i.e. the initial point will be taken with the minus sign, and the terminal point with the plus sign.
The curve integral is the easiest one, and we can notice that integrating the differential df of a function along the curve M denned as the image of a parametrization c : [a,b] —> M, then we get directly from the definition that
f df= f    c*(df)= f ±(foc)(t)dt
JM J[a,b] Ja at
= f(c(b))-f(c(a)).
Therefore, the result is not only independent of the selected parametrization, but also of the actual curve. Only the initial and terminal points matter. Splitting the curve into several
3Such operators are intrinsically defined on all manifolds. Actually, for all fc > 0, the only operation d : Qk —¥ Qk+1 commuting with pullbacks and with values depending only on the behavior of the argument a on any small neighborhood of x (locality of the operator), is the exterior derivative. Thus even the linearity, as well as the dependence on the first derivatives are direct consequences of naturality. See the book Natural operations in differential geometry, Springer, 1993, by I. Kolar, P.W. Michor and J. Slovak for full proof of this astonishing claim.
620
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
consecutive disjoint intervals, the integral splits into the sum of differences of the values at the splitting points. This sum will be telescoping (i.e., the middle terms cancel out), resulting in the same value again.
Notice, we have already proved the behavior expected in 9.1.5 when dealing with the elevation gain by a cyclist.
We shall discuss this phenomenon in general dimensions now. To be able to do this, we need to formalize the concept of the boundary of a manifold and its orientation. The simplest case is the closed half-space M = (—oo, 0] x Rn_1. Its boundary is dM = {{x1, x2, . . ., xn) e Rn; x1 = 0}. The orientation on this boundary inherited from the standard orientation is the one determined by the form dx2 A ■ ■ ■ A dxn.
Oriented boundary of a manifold
Let us consider a closed subset M C R™ such that its interior M C M is an oriented m-dimensional manifold covered by compatible parametrizations ipi. Further, let us assume that for every boundary point x e dM = M \ M, there is a neighborhood in M with parametriza-tion ip : V c {-oo, 0] x Rm_1 -> M such that the points x e dM nip(V) from just the image of the boundary of the half-space (-oo, 0] x Rm_1. The subset Mel™ covered by the above parametrizations with compatible orientations is called an oriented manifold with boundary.
The restrictions of the parametrizations including boundary points to the boundary dM defines the structure of an (m — l)-dimensional oriented manifold on dM.
Think of a closed unit balls B(x,r) c R™ as such manifolds. Their interiors are an n-dimensional manifolds, přikladu just open subsets in R , but their boundaries S™-1 are the spheres with the inherited structure of (n — 1)-dimensional manifolds. The inherited orientations are well understood via the outward normals to the spheres. Another example is a plane disc sitting as a 2-dimensional manifold in R3 with its 1-dimensional boundary being a circle. Here the chosen position of the normal to the plane defines the orientation of the circle, one or the other way.
In practice, we often deal with slightly more general manifolds where we allow for corners in the boundary of all smaller dimensions. A good example is the cube in R3 having the sides as 2-dimensional parts of the boundary and also the edges between them as 1-dimensional parts and the vortices as O-dimensional parts of the boundary. Yet another class of examples is formed by all simplexes and their curved em-beddings in R™. Since those lower dimensional parts of the boundary will have Riemann measure zero, we can neglect them when integrating over dM. Thus we shall not go into details of this technical extension of our definitions.
621
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.1.12. Stokes' theorem. Now, we get to a very important and useful result. We shall formulate the main theorem about the multidimensional analogy of curve integrals for smooth forms and smooth manifolds. A brief analysis of the proof shows that actually, we need once continuously differentiable exterior forms as integrands on twice continuously differentiable parametrizations of the manifold.
In practice, the boundary of the region is often similar as in the case of the unit cube in R3, i.e., we have discontinuities of the derivatives on a Riemann measurable set with measure zero in the boundary. In such a case, we divide the integration to smooth parts and add the results up. We can notice that although new pieces of boundaries appear, they are adjacent and have opposite orientations in the adjacent regions, so their contribution is canceled out (just like in the above case of boundary points of a piecewise differentiable curve).
Stokes' theorem
Theorem. Consider a smooth exterior (k — l)-form u with compact support on an oriented manifold M with boundary dM with the inherited orientation. Then we have
I  duj = I u.
■JM JdM
Proof. Using an appropriate locally finite cover of the manifold M and a partition of unity subordinated to it, we can express the integrals on both sides as the sum (even a finite sum, since the support of the considered form u is compact) of integrals of forms u supported in individual parametrizations. Thus we can restrict ourselves to just two cases M = Rfe or the half-space M = (-00,0] x R^"1.
In both cases, u will surely be the sum of forms Uj
uij = a,j(x)dxi A ■ ■ ■ A dxj A ■ ■ ■ A dxk,
where the hat indicates the omission of the corresponding linear form, and a, (x) is a smooth function with compact support. Their exterior differentials are
duj = (-1V 1——dx1 A ■ ■ ■ A dxk.
dx.,
Again, we can verify the claim of the theorem for such forms uij separately. Let us compute the integrals JM duj using the Fubini's theorem. This is most simple if M = R™,
f      f°° da
duj = (—/    ( / —— dxj)dxi ■ ■ ■ dxj ■ ■ ■ dxk
jRk-l J-00 dxj
= (—\)J    I      [ajj _   dxi ■ ■ ■ dxj ■ ■ ■ dxk = 0.
Notice, we are allowed to use the Fubini's theorem for the entire R™ since the support of the integrated function is in fact compact and thus we can replace the integration domain by a large multidimensional interval /. At the same time, the forms Uj are all zero outside of such a large interval I and thus
622
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
the integrals JgM uij all vanish and the claim of the Stokes' theorem is verified in this case. Actually, we may also say that dM = 0 and thus the integral is zero.
Next, let us assume M is the half-space (—00,0] x Rfe_1. If j > 1, the form uij evaluates identically to zero on the boundary dM, since 2; 1 is constant there and thus dx1 is identically zero on all tangent directions to dM. Integration over the interor M yields zero, using the same approach as above:
f°   f      f°° da dcjj = (—/    ( / —-2- dxjjdxi - ■ -dxj- ■ -dxk
= /     [aj]™ dxx ■ ■ ■ dxj ■ ■ ■ dxk = 0
J-ooJRk-1
since the function a, has compact support. So the theorem is also true in this case.
However, if j = 1, then we obtain
f dui1 = [     ( [    ^-dxi)dx2......dxk
Jm JR*-i \J-00 OXl J
a1(0,x2,...,xk)dx2---dxk= / u1.
K*-1 JdM
This finishes the proof of Stokes' theorem. □
9.1.13. Green's theorem. We have proved an extraordinarily strong result which covers several standard integral relations from the classical vector analysis. For instance, we can notice that by Stokes theorem, the integration of exterior differential dui of any fc-form over a compact manifold without boundary is always zero (for example, integral of any 2-form dui over the sphere S8 cl3 vanishes).
Let us look step by step at the cases of Stokes' theorem with k dimensional boundaries dM in R™ in low dimensions.
Green's theorem
In the case n = 2, k = 1, we are examining a domain M in the plane, bounded by a closed curve C = dM. Differential 1-forms are ui(x, y) = f(x, y) dx + g(x, y) dy, with the differential du = + ^)dx A dy. Therefore, Stokes'
theorem yields the formula
f(x,y)dx + g(x,y)dy = I ( + ^- J dx A dy
lc Jm\   dy dx
which is one of the standard forms of the Green's theorem.
Using the standard scalar product on R2, we can identify the vector field X with a linear form u>x such that u>x (Y) = (Y,X). In the standard coordinates (x, y), this just means that the field X = f(x,y)J^+ g(x, y) corresponds to the form cj = f(x, y) dx + g(x, y) dy given above.
The integral of uix over a curve C has the physical interpretation of the work done by movement along this curve in the force field X.
Green's theorem then says, besides others, that if u>x = dF for some function F, then the work done along a closed curve is always zero. Such fields are called potential fields
623
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
and the function F is the potential of the field X. In other words, the work done when moving in potential fields does not depend on the path, it depends only on the initial and terminal points.
With Green's theorem, we have verified once again that integrating the differential of a function along a curve depends solely on the initial and terminal points of the curve.
9.1.14. The divergence theorem. The next case deals with integrating over some open subset in R3 and it has got a lot of incarnations in practical use. We shall mention a few.
Gauss-Ostrogradsky's theorem
In the case n = 3, k = 2 we are examining a region M C R3, bounded by a surface S. All 2-forms are of the form
u) = f(x, y, z) dyAdz+g(x, y, z) dzAdx+h(x, y, z) dxA dy, and we get dui = (|£ + || + §|) dx A dy A dz. The Stokes' theorem says that
/ dy A dz + g dz A dx + hdx A dy s
df    dg    dh\ , iM\dx     dy     dz J This is the statement of the Gauss-Ostrogradsky theorem.
This theorem has a very illustrative physical interpretation, too.
Every vector field X = f(x,y,z)^+ g(x,y,z)-§^ + h(x,y,z)jfe can be plugged into the first argument of the standard volume form cu^s = dx A dy A dz on R3. Clearly, the result is a 2-form uix(x, y, z) = f(x, y, z)dy A dz + g(x, y, z)dz A dx + h(x, y, z)dx A dy.
The latter 2-form infinitesimally describes the volume of the parallelepiped given by the flux caused by the field X through a linearized piece of surface. If we consider the vector field to be the velocity of the flow of the particular points of the space, this infinitesimally describes the volume transported pointwise by the flow through the given surface S. Thus the left hand side is the total change of volume inside of S, caused by the flow of X.
The integrand of the right-hand side of the integral, is related to the so-called divergence of the vector field, which is the expression denned as
d{ujx) = (divX) dx Ady A dz.
The Gauss-Ostrogradsky theorem says
ix^r3 = / divXo;K3, s Jm
i.e. the volume of total flow through a surface is given as the integral of the divergence of the vector field over the interior. In particuclar, if div X vanished identically, then the total volume flow through the boundary surface of the region is zero as well.
624
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Such fields, with div X = 0, are called divergence free or solenoidal vector fields. They correspond to dynamics without changes of volumes (e.g. modelling dynamics of incompressible liquids).
In order to reformulate the theorem completely in terms 'i ■■ of functions, let us observe that the inherited volume form u>s on S is denned by the property v* A u>s = cjR3 at all points of S, where v* is dual form to the oriented (outward) unit normal to S. All forms of degree 2 are multiples of u>s by functions. In particular,
ix{v* Aws) = v ■ Xujs,
i.e. we have to integrate the scalar product of the vector field X with the unit normal vector with respect to the standard volume on S. Thus, we have proved the folowing result formulated in the classical vector analysis style.
Actually, a simple check reveals that the above arguments work for all open submanifolds Met™ with boundary hy-persurface S and vector fields X. The reader should easily verify this in detail.
Divergence theorem
Theorem. Let X be a vector field on a n-dimensional manifold Mcl" with hypersurface boundary S. Then
/  divXdxi... dxn = / X-vdS, Jm Js
where v is the oriented (outward) unit normal to S and dS stays for the volume inherited from Rn on S.
Notice the 2-dimensional case coincides with the Green's theorem above.
9.1.15. The original Stokes theorem. If cu is any linear form, then the integral of dcu over a surface depends on the boundary curve only. This is the most classical Stokes' theorem:
The classical Stokes' theorem
In the case n = 3, k = 1 we deal with a surface M in R3 bounded by a curve C. The general linear forms are cu = f dx + gdy + hdz, with the integral
f dx + g dy + h dz = I duj, c Jm
where dw = (|| - ff )dy A dz + (|| - A dx +
(If - %)dxAdy. _^
Again, we use the standard scalar product to identify the vector field X = f-§^ + g-§^ + h-§^ with the form <-°x = f dx+gdx+hdz. Finally, reverting the above relation between the vector fields and two forms on R3, the 2-form du>x can be identified with the vector field rot X,
dwx = o;K3(rotX, , ).
625
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
This field is called the rotation or curl of the vector field X. The Stokes' theorem now reads:
/   UJx =   I TOtX.
Jc Jm
Consequently, the fields X with the property u>x = dF for some function F (the fields of gradients of functions), have got the property rot X = 0. They are called conservative (or potential) vector fields.
9.1.16. Another kind of integration. As we have seen, so-!|8ia>      lutions to ODEs are flows of vector fields. As a jf^}      modification, we can prescribe one-dimensional <r»Vt'   linear subspaces /..,. C TXM at each point of — a manifold M and look for unparameterized curves P tangent to them at all points. This is a coordinate-free version of the ODE theory. Indeed, locally we may always choose a vector field X generating the spaces Lx and in each coordinate patch, the flow of X will provide the parama-terized one-dimensional submanifolds P C M tangent to Lx at all points. A change of coordinates or X will change the parameterizations, but not the curves P.
If we want to describe an n-dimensional submanifold N C M, 1 < n < M, in a similar way, we define the n-dimensional subspaces Dx c TXM for all x e M and seek for a submanifold iV with TyN = Dy at all y e N.
Integrability of distributions
The union D C TM of individual linear subspaces Dx c TXM, x e M, is called a distribution D on M. We say that the distribution is n-dimensional and smooth if each fixed point x allows for a neighborhood U and n linearly independent smooth vector fields X1,..., Xn generating Dy at all y e U. The distribution is called integrable if for each point x e M, there is a submanifold N C M such that x e N and TyN = Dy for all y eN.
Our goal is to give neccesary and sufficient conditions for smooth distributions to be integrable. Clearly, the case of n = 1 is trivial, since we already know that the conditions are empty - each such distribution is integrable.
The core idea is to use the so called flow box theorem for vector fields proved in 8.3.15 and to exploit the individual flows of the chosen generators X1,..., Xn in order to "draw" new coordinates, in which the integral submanifold would appear as given by xn+1 = 0,...,xm = 0. The problem we face is that the flows do not commute in general and thus our idea will not work.
9.1.17. Lie bracket of vector fields. Fortunately, the com-mutativity of the flows is captured by a simple differential operation.
Consider two vector fields on Rm, 1 = Xi{x)-J^ + ■■■+ X^x)^, Y = Y^x)^- + ■■■ + Ym(x)^. The commutator of the derivatives of functions in the directions
626
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
of these vector fields is
Y(Xf)-X(Yf) =YJY^(X3§l-)-Xl£-(Y3§xi-)
^ ^        dxj dxj ) dxi '
thanks to the commutativity of the second derivatives of /. Thus, the commutator of the two vector fields behaves as the vector field [X,Y],
This vector field is called the Lie bracket'' of its arguments. It is easy to see that [ , ] is a bilinear antisymmetric operation (over the real scalars) on the differentiable vector fields, and expanding the commutators explicitly we arrive at the so called Jacobi identity
[X, [Y,Z]] = [[X,Y],Z] + [Y, [X,Z]}
valid for all triples of vector fields, and the Leibnitz derivative property
[X,fY] = (Xf)Y + f[X,Y].
Remark. In fact, it is quite straightforward to see that the vector fields X and the diffeomorphisms Fl^ in their flows are linked in a very similar manner to the square matrices A and their exponential images etA. The Lie bracket encodes the composision of the diffeomorphisms like the commutators of matrices encode the matrix multiplication. Thus, it is not surprising that the flows of two vector fields are commuting if and only if their Lie bracket vanishes. We shall not go into the technical proof here since we shall not need the result explicitely below.
9.1.18. Back to distributions. We say that D c TM is an involutive distribution if for all vector fields X, Y valued in D, their Lie bracket [X, Y] has got values in D, too.
Frobenius' theorem
Theorem. Let D C TM be a smooth n-dimensional distribution in an m-dimensional manifold M. Then D is inte-grable if and only if it is involutive.
Proof. Remind integrability means the local existence of the integral submanifolds through each point in M. One of the implications of the proof is nearly trivial.
If D is integrable, then through each x e M there is the integral submanifold TV. Consider the embedding i : N —> M and any vector fields X, Y on M valued in D. Since Dy = TyN for all ,/    A, A and V are tangent to i(N) C
4Marius Sophus Lie (1842-1899)) was an excellent Norwegian mathematician, the father of the Lie theory. Originally invented to deal with systems of partial differential equation via continuous groups of their symmetries, the theory of Lie groups and Lie algbebras is nowadays in the core of a vast part of Mathematics. It is a pity we do not have time and space to devote more attention to this marvelous mathematical story in this textbook
627
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
M. We claim that the restriction of the Lie bracket [X, Y] to i(N) is the image i*([X, Y]), where the vector fields X, Y are viewed as the given fields on N, i.e., i*X(x) = X(i(x)), i*Y(x) = Y(i(x)). Thus, the bracket has to be in the image again. The latter claim is a consequence of a more general statement:
Claim. If p : N —> M is a smooth map and two couples of vector fields X, Y and X, Y satisfy
Tip o X = X o p,   Tp o Y = Y o p,
then their Lie brackets satisfy the same relation:
Tpo[X,Y] = [X,Y]0p.
Indeed, consider a smooth function / on M and compute, using X(fop)(x) = (TpX)f = (Xop)(x)f = X(p(x))f, and similarly for Y:
[X, Y](f o p)(x) =XY(fo p)(x) - YX(f o p)(x)
= X((Yf)0p)(x)-Y((Xf)0p)(x)
= X(Yf)(p(x))-Y(Xf)(p(x))
= ([X,Y]f)o<p(x).
Now we employ the latter claim for the inclusion i in the role of p and obviously every integrable distribution must be in-volutive.
As we already revealed, each one-dimensional distribution is involutive and locally integrable. The i, main idea of the proof is to start with any set of (locally) generating vector fields for d, to use some nice coordinates with respect to the first vector field and to employ the induction on the dimension to the rest of them.
Assume the theorem is true for dimensions less than n and consider an involutive smooth distribution d of dimension n, generated by fields X1..., Xn. Actually, we shall prove a much stronger version of the theorem. We claim that if d is involutive, then there are coordinates (x1,..., xm) around each point x e M, such that the equations xn+1 = an+1, ...,xm = am with small constants a{ are defining all the individual integral submanifolds of d through points close to x. This is indeed true in dimension n = \.
By the flowbox theorem 8.3.15, for each point x e M there are coordinate functions y±,..., ym on a neighborhood U of x, for which X\ = g|^-. Let us consider the submanifold Q C M defined by y2 = 0,..., ym = 0 and the "projections" Yj of the other fields to make then tangent to Q. This requires that Yj leave constant the coordinate yi, i.e. we set
Y; = X; - XjiyJXu   j = 2,...,m.
Indeed, this definition ensures Yj (yi) = 0 and thus the fields are tangent to Q as required. We leave Y1 = X1, and clearly Yi,...,Yn generate the same involutive distribution d. Thus
[Yi,Yj] = 52cijkYk
628
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
for some set of functions Cijk. Moreover, we may view Q as one leaf of all the subsets denned by y2 = b2,..., ym = bm with small constants b{ and there is the projection p : U —> Q forgetting the first coordinate.
On the submanifold Q, there is the (n — 1)-dimensional involutive distribution D generated by the fields Y = Y%\q, i = 2,..., n (notice we again use the argument from the beginning of the proof about the brackets of restricted fields). Now, our assumption says we find suitable coordinates (q2,... ,qm) on Q around the point x e Q, so that for all small constants bn+1, ...,bm, the integral submanifolds of D are denned by qn+1 = bn+1, ...,qm=bm.
Finally, we need to adjust the original coordinate functions yi all over the neighborhood U of x. The obvious idea is to use the flow of X1 = Y1 to extend the latter coordinates on Q. Thus we define the coordinate functions in all y e U using the projection p,
xi(y) = yi(y),x2(y) = q2(p(y)),- ■ .,xm = qm{p{y)).
The hope is that all submanifolds N given by equations xn+1 = bn+1,..., xm = bm (for small bj) will be tangent to all fields Yi,...,Yn. Technically, this means Yi(xj) = 0 for alii = 1,..., n, j1 = n + 1,..., m. By our definition, this is obvious for the restriction to Q, and obviously Y\ (xj) = 0 in all other points, too.
Let us look closely on what is happening with one of our functions Yi(xj) along the flows of the field X1. We easily compute with the help of the definition of the Lie bracket
■^(Yiixj)) = Y1(Yi(xj)) = YMfa)) + [Y^fa)
m
= YiiXi{xj)) + cmY^Xj) + y^/clikYk(xj)
k=2
m
= y^cukYkjxj).
k=2
This is a system of linear ODEs for the unknown functions Yi(xj) in one variable x1 along the flow lines of Y1. The initial condition at the point in Q is zero and thus this constant zero value has to propagate along the flow lines, as requested. The induction step is complete. □
9.1.19. Formulation via exterior forms. As we know from linear algebra, a vector subspace of codimension k is defined by k independent linear forms. Thus, every smooth n-dimensional distribution D C TM on a manifold M can be (at least) locally denned by m — n linear forms uij on M.
A direct computation in coordinates reveals that the differential of linear from cu evaluates on two vector fields as follows
(1)    duj(X, Y) = X{uj{Y)) - Y{uj{X)) - uj{[X, Y]).
629
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Indeed, if X = J2t X,£-, Y = J2t Y£-, cu = J2tcu,dx,, then
X(co(Y))-Y(co(X)) = jyXi^jYjyY^icjjXj))
= duj{X,Y) +uj{[X,Y]).
Thus, the involutivity of a distribution denned by linear forms cjn+i,... ,cjrn should be closely linked to properties of the differentials on the common kernel. Indeed, there is the following version of the latter theorem:
Frobenius' theorem
Theorem. The distribution D defined on an m-dimensional manifold M by (m — n) independent smooth linear forms oJn+i, ■ ■ ■ ,wm is integrable if and only if there are linear forms ctij such that duj^ = J2e a^e A uji.
Proof. Let us write cu = (cun+i,... ,wm) fortheR"1-™-valued form. The distribution is D = kero;. Now, the formula (1) (applied to all components of cu) implies that involutivity of D is equivalent to doj\    u = 0.
If the assumption of the theorem on the forms holds true, dui clearly vanishes on the kernel of cu and therefore D is invo-lutive, and one of the implications of the theorem is proved.
Next, assume D is integrable. By the stronger claim proved in the latter Frobenius theorem, for each point x e M, there are coordinates (x1,... ,xm) such that D is the common kernel of all dxn+i, ■ ■ ■, dxm. In particular, our forms uij are linear combinations (over functions) of the latter (m — n) differentials. Moreover, there must be smooth invertible matrices of functions A = (a^) such that
dxk =      a-ki^e,   k, £ = n + 1,..., m.
e
Finally, dcu^ includes only terms with dx{ A dxj with j > n and all dxj can be expresed via our forms u>i from the previous equation. Thus the differentials have got the requested forms. □
2. Remarks on Partial Differential Equations
The aim of our excursion into the landscape of differential equations is modest. We do not have space in this rather elementary guide to come close enough to this subtle, beautiful, and extremely useful part of mathematics dealing with differential equations. Still we mention a few issues.
First, the simplest method reducing the problem to already mastered ordinary differential equations is explained, based on the so called characteristics. Then we show more simple methods how to get some families of solutions.
Next, we present a more complicated theoretical approach dealing with formal solvability of even higher order systems of differential equations and its convergence - the
630
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
famous Cauchy-Kovalevskaya theorem. This is the only instance of general existence and uniqueness theorem for differential equations involving partial derivatives. Unortunately, it does not cover many of interesting problems of practical importance.
Finally, we display a few classical methods to solve boundary problems involving some of the most common equations of second order.
9.2.1. Initial observations. In practical problems, we often meet equations relating unknown functions of more variables and their derivatives. We already
handled the very special case where the relations concerned functions x(t) of just one vari-
able t. More explicitely, we dealt with vector equations
x(k) = f(t, x, x, £, x,x{k-^),   f : Rnk+1 -> Rn,
where the dots over ieB" meant the (iterated) derivatives of x(t), up to the order k. The goal was to find a (vector) curve x(t) in Rn which makes this equation valid.
Two more comments are due: 1) we can omit the explicit appearance of t on the cost of adding one more variable and equation x0 = 1; and 2) giving new names to the iterated derivates Xj = x^) and adding equations ±j = Xj+1, j = 1,..., k — 1, we reduce always the problem to a first order system of equations (on a much bigger space).
Thus, we should like to work similarly with the equations
P(*e> Vi UXi UXXi UXyi Uyy, - - - ) — 0,
where u is an unknown function (possibly vector valued) of two variables x and y (or even more variables) and, as usual, the indices denote the partial derivatives. Even if we expect the implicit equation to be solved in some sence with respect to some of the highest partial derivatives, we cannot hope for a general existence and uniqueness result similar to the ODE case.
Let us start with a most simple example illustrating the general problem related to the choice of the initial conditions.
9.2.2. The simplest linear case. Consider one real function u = u(x, y), subject to the linear homogeneous equation
(1) a(x,y)ux + b(x,y)uy = 0
where a and b are known functions of two variables denned for x, y in a domain 17 C t2. We consider the equation in the tubular domain fl x R c R3. Usualy, fl is an open set together with a nice boundary, a curve dfl in our case.
An obvious simple idea suggests to write fl as a union of non-intersecting curves and look for u constant along those curves. Moreover, if those curves were transversal to the boundary dfl, then initial conditions along the boundary should extend inside of fl. Thus, consider such a potentially existing curve c(t) = (x(t), y(t)) and write
0 = ^-u(c(t)) = ux{c{t))x{i) + uy(c(t))y(t).
631
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
This yields the conditions for the requested curves:
(2) x = a(x,y),       y = b(x,y).
Since u is considered constant along the curve, we obtain a unique possibility for the function u along the curves for all initial conditions x(0), y(0), and u(x(0), y(0)), if the coefficients a and b are at least Lipschitz in x and y.
The latter curves are called the characteristics of the first order partial differential equation (1) and they are solutions of its characteristic equations (2). If the coeficients are dif-ferentiable in all variables, then also the solution u will be dif-ferentiable for differentiable choices of initial conditions on a curve transversal to the characteristics and we might have solved the problem (1) locally. Still it might fail.
Let us look at the homogeneous linear problem
(3) yux — xuy = 0,      u(x, 0) = x.
We saw already the solutions to the characteristic equations
x = y,       y = -x
and the characteristics are circles with centers in the origin, x(t) = Rsmt, y(t) = Rcost. If we choose any even differentiable function ip{x) = u(x, 0) for the initial conditions at points (x, 0), we are lucky to see that the solution will work. But for odd functions, e.g. our choice ip(x) = x, there will be no solution of our problem in any neighbourhood of the origin. Clearly, this failure is linked to the fact that the origin is a singular point for characteristic equations.
9.2.3. The quasi-linear case. The situation seems to get more tricky once we add a nontrivial right-hand value f(x, y, u) to the equation (1), i.e. we try to solve the problem (allowing a and b to depend on u)
(1) a{x, y, u)ux + b{x, y, u)uy = f(x, y, u).
But in fact, the very same idea leads to characteristic equations on R3, writing z = u(x,y) for the unknown function along the characteristics. Geometrically, we seek for a vector field tangent to all graphs of solutions in the tubular domain fl x R. Remind z = u(x, y), restricted to a curve in the graph, implies z = uxx + uyy, and thus we may set z, = f(x, y, z), x = a(x, y, z), y = b(x, y, z) in order to get such a characteristic vector field.
Characteristic equations and integrals The characteristic equations of the equation (1) are
(2) x = a(x,y,z),    y = b(x,y,z),    z = f(x,y,z).
This autonomous system of three equations is uniquely solvable for each initial condition if a, b, and / are Lipschitz.
A function t/j on fi x R which is constant on each flow line of the characteristic vector field, i.e., t/j(x(t),y(t),z(t)) = const for all solutions of (2), is called an integral of the equation (1). If i/jz ^ 0, then the implicit function theorem guarantees the unique existence of the function z = u(x,y) satisfying the chosen initial conditions.
632
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Check yourself that the latter functions u are solutions to our problem. This approach covers the homogeneous case as well, we just consider the autonomous characteristic equations with i = 0 added.
Let us come back to our simple equation 9.2.2(3) and choose j(x,y,u) = y for the right-hand side. The characteristic equations yield x = R sin t, y = R cos t as before, while z = y = Rcost and hence z = Rsint + z(0). Thus, we may choose t/j(x,y,z) = z — x as an integral of the equation, and the solution u(x,y) = x + C with any constant C.
Notice, there will be plenty of solutions here since we may add any solution of the homogenous problem, i.e. all functions of the form
(3) u(x,y) = h(x2 + y2)
with any differentiable function h. Thus, the general solution u(x, y) = x + h(x2 + y2) depends on one function of one variable (the above constant C is a special case of h).
We may also conclude that for "reasonable" curves dfl C R2 (those transversal to the circles centred at the origin and not containing the origin) and "reasonable" initial value u\qq (we have to watch the multiple intersection of the circles with dfl) there will be (at least locally) a unique solution extending the intital values to an open neighborhood ofdO.
Of course, we may similarly use characteristics and integrals for any finite number of variables x = (x1,..., xn) and equations of the form
.     , du , du ,
ai(x,u)---1-----\-an(x,u)-— = J[x,u)
OX i oxn
with the unknown function u = u(x1,... ,xn). As we shall see later, typically we obtain generic solutions dependent on one function of n — 1 variables, similarly to the above example.
9.2.4. Systems of equations. Let us look what happens if we add more equations. There are two quite different ways how to couple the equations.
We   may   seek   for   a   vector   valued function
u= («!,..., um) : R™ —> Rm, subject to m equations
(1)       Ai(x,u) ■ Vu{ = fi(x,u),   i = l,...m,
where the left hand side means the scalar product of a vector valued function A{ : Rm+n —> R™ and the gradient vector of the function Ui. Such systems behave similar as the scalar ones and we shall come back to them later.
The other option leads to the so called overdetermined systems of equations. Actually we shall not pay more attention to this case in the sequel and so the reader might jump to 9.2.6 if getting lost.
Consider a (scalar) function u on a domain in fl C R™ and its gradient vector Vu. For each matrix A = (ciij) with m rows and n columns, with differentiable functions aij(x, u) on fl x R, and the right hand value function F(x, u) : fl x R —> Rm, we can consider the system of equations
633
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
(2) A(x,u) ■ Vu = F(x,u).
Of course, in both case, we have got m individual equations of the type from the previous paragraph and we could apply the same idea of characteristic vector fields for all of them. The problem consists in coupling of the equations and obtaining possibly inconsistent neccesary condition from the individual characteristic fields.
Let us look at the overdetermined case now. We can get most close to the situation with the ordinary differential equations if A is invertible and we move it to the right hand side, arriving at the system of equations
(3) Vu = A'1^,^ ■ F{x,u) = G{x,u).
The simplest non-trivial case consists of two equations in two variables:
ux = f(x, y, u),    uy = g(x, y, u).
Geometrically, we describe the graph of the solution as a surface in R3 by prescribing its tangent plane through each point. An obvious condition for the existence of such u is obtained by differentiating the equations and emploing the symmetry of the higher order partial derivatives, i.e. the condition uxy = uyx- Indeed,
t~xy      fy + Iu9      9x + 9uj ^yx:
where we substituted the original equations after applying the chain rule. We shall see in a moment that this condition is also sufficient for the existence of the solutions. Moreover, if the solutions exist, then they are determined by their values in one point, similarly to the ordinary differential equations.
9.2.5. Frobenius' theorem again. Similarly, we can deal with the gradient Vu of an m-dimensional vector valued function u. For example, if m = 2 and n = 2 we are describing the tangent planes to the two-dimensional graph of the solution u In general we face ran equations
(1)        — = Fpi{X,U),     l = 1,. . . ,71, p = 1,. . . ,771.
The necessary conditions imposed by the symmetry of higher order derivatives then read
m   a2up _ af__ ,      ____ F  _ _____ ,      _____ F
W      C>Xp)Xj   ~    dXj    "T" 3uq  rV~     dxt     "T" 3uq rV
q q
for all i, j and p.
Let us reconsider our problem from the geometric point of view now. We are seeking for the graph of the mapping u : R™ —> Rm. The equations (1) describe then-dimensional distribution D on Rm+n and the graphs of possible solutions u = («!,..., um) axe, just the integral manifolds of D. The distribution D is clearly denned by the m linear forms
Up
= dup-^Fpidxi, p=l,.
634
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
while the vector fields generating the common kernel of all
u)p can be chosen as
X^-^+TFi — . 1     dxj    ^ pl du„ '
Now we compute differentials dcup and evaluate them on the fields Xi
—dcuv =       . pl dxj A dxi + pl du„ A dxi
^ dx;    J ^ du„
i,j J i,q H
l,j ■> q H
^v{X],Xi) = (^ + Y.d-Kr^)
&J q ^1
tdFvj + 9fpj f ^ dii     ^ dua '
q
qi)
Thus, vanishing of the differentials on the common kernel is equivalent to the neccesary conditions deduced above, and the Frobenius theorem says that the latter conditions are sufficient, too. We have proved the following:
Theorem. The system of equations (1) admits solutions if and only if the conditions (2) are satisfied. Then the solutions are determined uniquely locally around x G Q by the initial conditions u(x) G Rm.
Remark. The Frobenius' theory deals with the so called overdetermined systems ofPDEs, i.e. we have got too many equations and this causes obstructions towards their integra-bility. Although the case in the last paragraph sounds very special, the actual use of the theory consists in considering differential consequences of a given system until we reach a point, where the special theorem applies and gives not only further obstractions but also the sufficient conditions.
9.2.6. General solutions to PDE's. In a moment, we shall deal with diverse boundary conditions for the solutions of PDEs. In most cases we shall be Sb^fe happy to have good families of simple "guessed" ^Vspb^-J— solutions which are not subject to any further conditions. We talk about general solutions in this context. Unlike the situation with ODEs, we should not hope to get a universal expression for all possible solutions this way (although we can come close to that in some cases, cf. 9.2.3(3)). Instead, we often try to find the right superpositions (i.e. linear compbinations) or integrals build on a suitable general solutions.
Let us look at the simplest linear second order equations in two variables, homogeneous with constant coefficients:
(1)    Auxx + 2Buxy + Cuyy + Dux + Euy + Fu = 0
where A, B, C, D, E, F are real constants and at least one of A, B, C is non-zero.
Similarly to the method of characteristics, we try to reduce the problem to ODEs. Let us again assume solution
s
635
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
in the form u = f(p), where / is an unknown function of p and p(x, y) should be nice enough to get close to solutions. The necessary derivatives are ux = f'px, uy = f'py,
V'xx        f PxPx + f Pxxi V'xy        f PxPy + f Pxyi ^yy
f'PyPy + f'Pyy Thus (1) becomes too complicated in general, but restricting to affinep(a;, y) = ax+/3y with constants a, /3, we arrive at
(2) (Aa2 + 2Baf3 + Cf32)/" + (Da + E0)f + Ff = 0.
This is a nice ODE as soon as we fix the values of a and (3. Let us look at several simple cases of special importance.
Assume D = E = F = 0, A^O. Then, after dividing by a2, we solve the equation (A + 2B^ + Cfr)/" = 0 and the right choice of the ratio A = j3/a =^ 0 kills the entire coefficient at /". Thus, (2) will hold true for any (twice dif-ferentiable) function / and we arrive at the general solution u(x,y) = f(p(x, y)), withp(a;, y) = x + Ay. Of course, the behavior will very much depend on the number of real roots of the quadratic equation
A + 2BX + C\2 = 0.
The wave equation. Put A = 1,C = B = Q, thus our equation is uxx = -^uyy, the wave equation in dimension 1. Then the equation 1 — ^A2 = 0 has got two real roots A = ±c, and we obtain p = x ± cy leading to the general solution
u{x, y) = f(x - cy) + g(x + cy)
with two arbitrary twice differentiable functions of one variable / and g.
In Physics, the equation models one-dimensional wave development in the space parametrized by x while y stays for the time. Notice c corresponds to the speed of the wave u(x, 0) = f(x) + g(x) initiated in the time y = 0, and while the / part moves forwards, the other part moves backwards. Indeed, imagine u(x,y) = f(x — cy) describes the displacement of a string at point x in time y. This remains constant along the lines x — cy = constant. Thus, a stationary observer sees the initial displacement u(x, 0) moving along x-axis with the speed c.
In particular, we see that the initial condition along a line in the plane is not enough to determine the solution, unless we request the solution will move only in one of the possible directions (i.e. we posit either / or g to be zero).
The Laplace equation. Now we consider A = C = 1, B = 0, i.e. the equation uxx + uyy = 0. This is the Laplace equation in two dimensions and its solutions are called harmonic functions.
Proceeding as before, we obtain two imaginary solutions to the equation A2 + 1 = 0 and our method produces p = x ± iy, a complex valued function instead of the expected real one. This looks ridiculous, but we could consider / to be a mapping / : C —> C viewed as a mapping on the complex plane. Remind that some of such mappgins have got differentials D1f(p) which actually are multiplications by complex numbers at each point, cf. ??. This is in particular true for
636
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
any polynomial or converging power series. We may request that this property holds true for all iterated derivatives of this kind. In general, we call such functions on C holomorhic and we discuss them in the last part of this chapter.
Now, assuming / is holomorphic, we can repeat the above computation and arrive again at
(A2 + l)/"(p) = 0
independently of the choice of / (here f'(p) means the complex number given by the differential D1f,f"(p)is the iteration of this kind of derivative). Moreover, the derivatives of vector valued functions are computed for the components separately and thus both the real and the imaginary part of the general solution f(x + iy)+g(x — iy) will be real general solutions.
For example, consider f(p) = p2 leading to
u(x, y) = (x + iy)2 = (x2 — y2) + i 2xy
and simple check shows that both terms satisfy the equation separately. Notice the two solutions x2 — y2 and xy povide the bases of the 2-dimensional vector space of harmonic homogeneous polynomials of degree two. The diffusion equation. Next assume A = k, B = C = D = F = 0, and add the first order term with E = — 1. This provides the equation
the diffusion equation in dimension one.
Applying the same method again, we arrive at the ODE
«Q2/" -Pf = 0
which is easy to solve. We know the solutions are found in the form f(p) = e"p with v satisfying the condition k«V — (3v = 0. The zero solution is not interesting, thus we are left with the general solution to our problem by substituting
p{x,y) = ax + /3y:
u(x,y) = f(p)=ei^+^.
Again, a simple check reveals that this is a solution. But it is not very "general" - it depends just on two scalars a and (3. We have to find much better ways how to find solutions of such equations.
9.2.7. Nonhomogeneous equations. As always with linear equations, the space of solutions to the homogeneous linear equations is a real vector space (or complex, if we deal with complex valued solutions).
Let us write the equation as Lu = 0, where L is the differenatial operator on the left hand side. For instance,
L    Ad2     B d2      cd2     D&    Ed F dx2       dxdy       dy2       dx dy in the case of the linear equation 9.2.6(1).
The solutions of the corresponding non-homogeneous equation Lu = f with a given function / on the right hand side form an affine space. Indeed, if Lu\ = /, Lu2 = /, Lu3 = 0, then clearly L(ui —u2) = 0 while L(ui +u3) = /.
637
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Thus, if we succeed to find a single solution to Lu = f, then we can add any general solution to the homogeneous equation to obtain a general solution.
Let us illustrate our observation on some of our basic examples. The non-homogenous wave equation uxx — uyy = x + y has got the general solution
u(x>y) = ^(z3 -y3) + f(x-y) + g(x + y)
depending on two twice differentiable functions.
The non-homogeneous Laplace equation is called the Poisson equation. A general complex valued solution of the Poisson equation uxx + uyy = x + y is
u{x,y) = \{x3 + y3) + f(x - iy) + g(x + iy) o
depending on two holomorphic functions / and g.
9.2.8. Separation of variables. As we have experienced, a straightforward attempt to get solutions is to expect them in a particular simple form. The method of separation of variables is based on the assumption that the solution will appear as a product of single variable functions in all variables in question. Let us apply this method on our three special examples.
Diffusion equation. We expect to find a general solution of
kuxx = ut in the form u(x, t) = X(x)T(t). Thus the equation says kX" (x)T(t) = T'(t)X(x). Assume further u ± 0 and divide this equation by u = XT:
X"(x) _ T'(t) X{x)  ~ nT(t)'
Now the crucial observation comes. Notice the terms on the left and right are function of different variables and thus the equation may be satisfied only if both the sides become constant. We shall have to distinguish the sings of this separation constant, so let us write it as —a2 (choosing the negative option). Thus we have to solve two independent ODEs
X' + a2X = 0,   T" + q2kT = 0.
The general solutions are
X(x) = A cos ax + B sin ax
T{t) = Ce-"2"'
with free real constants A, B, C. When combining these solutions in the product, we may absorb the constant C into the other ones and thus we arrive at the general solution
u(x, t) = (A cos ax + B sin ax) e~a .
This solution depends on three real constants.
If we choose a positive separation constant instead, i.e. A2, there will be a sign change in our equations and the resulting general solution is
2 ,
u(x,t) = (Acoshax + Bsinhax) ea .
If the separation constant vanishes, then we obtain just u(x, t) = A + Bx, independent of t.
638
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
The Laplace equation. Assume u(x, y) = X(x)Y(y) satisfies the equation uxx + uyy = 0 and proceed exactly as above. Thus, X" Y + Y"X = 0 and dividing by XY and choosing the separation constant a2, we arrive at
X" = a2X,   Y" = -a2Y.
The general solution depends on four real constants
A, B, C, D
u(x, y) = (A cosh ax + B sinh ax) (C cos ay + D sin ay).
If the separation constant is negative, i.e. —a2, the roles of x and y swap.
The wave equation. Let us look how the method works if there are more variables there.   Consider a solution
u(x, y, z, t) = X(x)Y(y)Z(z)T(t) of the 3D wave equation
1
~2~lU>tt      y>xx     ^yy ^zz-
Playing the same game again, we arrive at the equation
±T"XYZ = X'YZT + Y"XZT + Z" XYT. Dividing by u 0,
c2 T X Y Z and since all the individual terms depend on different single variables, they have to be constant. Again, we shall have to keep attention to the signs of the separation constants. For instance, let us choose all constants negative and look at the individual four ODEs
with the constants satisfying — a2 = —{? — j2 — S2. The general solution is u(x, y, z, t) = X(x)Y(y)Z(z)T(t) with linear combinations
T(t) = A cos cat + B sin cat X(x) = C cos px + D sin px Y(y) = 75 cos 71/ + F sin jy Z(z) = G cos Sz + H sin Sz
with eight real constants A through 77.
If we choose any of the separation constants positive, the corresponding component in the product would display hyperbolic sine and cosine instead. Of course, the relation between the constants sees the signs as well.
We can also work with complex valued solutions and choose the exponentials as our building blocks (i.e. X(x) = e±il3x Qr x(x) = e±/3x, etc). For instance, take one of the solutions with all the separation constants negative
u(x, y, z, t) = e1^      elSz e~lcat = ^(H^y+Sz-cat) _
Similarly to the ID situation, we can again see a "plane wave" propagating along the direction (/3,7, S) with angular frequency ca.
639
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.2.9. Boundary conditions. We continue with our examples of second order equations and discuss the three most common boundary conditions for them. Let us consider a domain C R", bounded or unbounded, and a differential operator L denned on (real or complex valued) functions on fl. We write dfl for the boundary of fl and assume this is a smooth manifold.
Locally, such a submanifold in R™ is given by one implicit function p : R™ —> R and the unit normal vector v(x), x e dfl, to the hypersurface dfl is given by the normalized gradient
vfx) = WP{X)
[) nv^ir
We say that a function u is differentiable on fl, if it is differ-entiable on its interior and the directional derivatives D],u(x) exist in all points of the boundary. Typically we write ^ for the derivative in the normal direction.
For simplicity, let us restrict ourselves to L of the form
and look at the equation Lu = F(x, y, u, |^).
Cauchy boundary problem
At each point of the boundary x e dfl we prescribe both the value p(x) = u(x) and the derivative ip(x) = §^(x) in the normal unit direction. The Cauchy problem is to solve the equation Lu = F on fl, subject to u = p and |^ = ip on dfl.
We shall see that the Cauchy problems very often lead locally to unique solutions, subject to certain geometric conditions on the boundary dfl. At the same time, it is often not the convenient setup for practical problems. We shall illustrate this phenomenon on the 2D Laplace equation in the next but one paragraph.
An even simpler possibility is to request only the condition on the values of u on the boundary dfl. Another possibility, often needed in direct applications, is to prescribe the derivatives only. We shall see, that this is reasonable for the Laplace and Poisson equations.
DlRICHLET AND NEUMANN BOUNDARY PROBLEMS
At each point of the boundary x G dfl we prescribe the value p(x) = u(x) or the derivative ip(x) = ^j(x) in the normal unit direction.
The Dirichletproblem is to solve the equation Lu = F on fl, subject to the condition u = p on dfl.
The Neumann problem is to solve the equation Lu = F
du 8v
on fl, subject to the condition §^ = ip on dfl.
9.2.10. Uniqueness for Poisson equations. Because the proof of the next theorem works in all dimensions n > 2, we
640
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
shall formulate it for the general Poisson equation
d2 d2 (1)    Au=(^ + --- + -fali)u = F(xi>--->x>J-
Theorem. Assume u is a twice differentiable solution of the Poisson equation (1) on a domain fl C Kn. If u satisfies the Dirichlet condition u = p on dfl, then u is the only solution of the Dirichlet problem.
If u satisfies the Neumann condition |^ = ip on dfl, then u is the unique solution of the Neumann problem, up to an additive constant.
The proof of this theorem relies on a straightforward consequence of the divergence theorem. Remind 9.1.14, saying that for each vector field X on a domain fl C K™ with hyper-surface boundary dfl
(2)        /  divXdx1...dxn=      X ■ v ddfl, J m Jan
where v is the oriented (outward) unit normal to dfl and ddfl stays for the volume inherited from R™ on dfl.
1ST and 2ND green's identity
Let M c I" be a n-dimensional manifold with boundary hypersurface S, and consider two differentiable functions p and Then
(3)   / (pAif; + X7p ■ Vip) dx!... dxn = / p\7ip ■ v dS. Jm Js
This version of the divergence theorem is called the 1st
Green's identity.
Next, let us consider one more differentiable function
[i and X = ppS/ip — ippS/p. The the divergence theorem
yields the so called 2nd Green's identity
p{V ■ (/iV))^ - V(V ■ {nV))p dxx... dxn p{tpVip — ipVp) ■ v dS,
(4)
where V ■ (/iV) means the formal scalar product of the two vector valued differential operators.
Proof of the Green's identities. The first claim follows by applying (2) to X = pS/ip, where p and t/j are differentiable functions and S7tp. Indeed,
ix^K" = p{Vip ■ iAdS div X = pAtp + Vp ■ Vt/>,
where the dot in the second term denotes the scalar product of the two gradients. Let us also notice that the scalar product S7tp -v is just the derivative of t/j in the direction of the oriented unit normal v.
The second identity is computed the same way and the two terms with the scalar products of two gradients cancel each other. The reader should check the details. □
641
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Remark. A special case of the 2nd Green's identity is worth mentioning. Namely, if p = 1 and both ip and p vanish on the boundary dfi, we obtain
pAifj — ipAp dxi... dxn = 0.
n
This means that the Laplace operator is self-adjoint with respect to the L2 scalar product on such functions.
Proof of the uniqueness. Assume ui and u2 are solutions of the Poisson equation on fi, thus u = u± — u2 is a solution of the homogeneous Laplace equation,
Au = Alii - Au2 = F -F = 0.
At same time, either u = ui — u2 = 0 on dfi or 4^ = 0 on
dfi.
Now we exploit the first Green's identity (3) with p
ip = u,
I (uAu + Vu ■ Vu) dxi... dxn = I Jn J as
9x1 JQ u—db.
In Jan °n
In both problems, Dirichlet or Neumann, the right hand side vanishes. The first term in the left hand integrand vanishes, too. We conclude
|| Vu||2 dxi... dxn = 0,
in
but this is possible only if Vu = 0 since the integrand is continuous. Thus, u = u1 — u2 is constant. But if we solve a Dirichlet problem, then ui and u2 coinside on the boundary and thus they are equal. □
9.2.11. Well posed problems. Consider the Cauchy boundary problem for uxx + uyy = Q,dfi given by y = 0 and
p(x) = u{x, 0) = Aa sin ax ip(x) = uy{x, 0) = Ba sinaa;
with the scalar coefficients Aa and Ba depending on the chosen frequency a. Simple inspection reveals, that we can find such a solution within the result from the separation method:
1
u
[x, y) = (Aa cosh ay H--Ba sinh ay) sin ax.
a
Now, choose Ba = 0 and Aa = -\, i.e.
u(x,y) = — cosh ay sinaa;.
a
Obviously, when moving a towards infity, the Cauchy boundary conditions can become arbitrarily small and still small change of Ba causes arbitrarily big increase of the values of u in any close vicinity of the line y = 0.
Imagine, the equation describes some physical process and the boundary conditions reflect some measurements, including some periodic small errors. The results will be horribly instable with respect to these errors in the derivatives. We should admit that the problem is in some sense ill-posed, even locally. This motivates the following definition.
642
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Well-posed and ill-posed boundary problems
The problem Lu = F on the domain fl with boundary conditions on dfl is called well-posed if all three conditions hold true:
(1) The boundary problem has got a solution u (a classical solution means u is twice continuously differentiable);
(2) the solution u is unique;
(3) the solution is stable with respect to initial data, i.e. "small" change of the boundary conditions results in a "small" change of the solution.
The problem is called ill-posed, if any of the above conditions fails.
Usualy, the stability in the third condition means that the solution is continuously dependent on the boundary conditions in a suitable topology on the chosen space of functions.
Also the uniqueness required in the second condition has to be taken reasonably. For instance, only uniquenes up to some additive constant makes sense for the Neumann problems.
9.2.12. Quasilinear equations. Now we exploit our experience and focus on the (local) Cauchy type problems for equations of arbitrary order. Similarly to the ODEs, we shall deal with problems, where the highest order derivatives are prescribed (more or less) explicitly and the initial conditions are given on a hypersurface up to the order k — 1.
Some notation will be useful. We shall use the multi-indices to express multivariate plynomials and derivatives, cf. 8.1.15. Further we shall write Vfeu = {dau; \a\ = k} for the vector of all derivatives of order k. In particular, Vu means again the gradient vector of u.
Quasi-linear PDEs
For unknown scalar function u on a domain fl C R™ we prescribe its derivatives
(1)   ^2aa(x,u,. .., Vfe_1u)9Qu = b(x,u,Vfe_1u),
\a\=k
where b and aa are functions on the tubular domain fl x RN, accomodating all the derivatives, with at least one of aa nonzero. We call such equations the (scalar) quasilinear partial differential euqations (PDE) of order r.
We call (1) semilinear if all aa do not depend on u and its derivatives (thus all the non-linearity hides in b).
The principal symbol of a semi-linear PDE of order k is the symmetric fc-linear form P on fl,
P(x) : (Rn')k —> R,   P(x,t,...,£) = 5>Q(a:)r.
\a\=k
For instance, the most general Poisson equation Au = f(x, y, u, Vu) on R2 is a semi-linear equation and its principal symbol is the positive definite quadratic form P((, n) = C2 + rj2, independent of (x, y). The diffusion equation Qjf =
643
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Au on R3 has got the symbol P(t, £, n) = s + "2. i-e- a positive semi-definite quadratic form, while the wave equa-tion Dm = — lyi1 = 0 has got the indefinite symbol P(T,Q =t2-C2 onR2.
We shall focus on the scalar equations and reduce the problem to a special situation which allows a further reduction to a system of first order equations (quite similarly to the ODE theory). Thus we extend the previous definition to systems of equations. Notice, these are systems of the first kind mentioned in 9.2.4.
Systems of quasi-linear PDEs
A system of quasi-linear PDEs determines a vector valued function u : fl c R™ —> Rm, subject to the vector equation
(2)    A(x, u,..., Vfe_1u) ■ Vfeu = b(x, u,..., Vfe_1u).
Here A is a matrix of type m x M with functions aitCl : flxKNas entries, M = is the number of k-
combinations with repetition from n objects, Vfeu is the vector of vectors of all the fcth-order derivatives of the components of u, b : fl x RN —> Rm, and ■ means the scalar products of the individual rows in A with the vectors S/ku{ of the individual components of u, matching the individual components in b.
9.2.13. Cauchy data. Next, we have to clarify the boundary condition data. Let us consider a domain U C R™ and a smooth hypersurface r C U, e.g. r given by an implicite equation f(x1,... ,xn) = 0 locally. Consider the unit normal vector u(x) at each point x e T (i.e. v = jA—\/f jf given implicitly). We would like to find minimal data along r determining a solution of 9.2.12(1), at least locally around a given point.
To make things easy, let us first assume that r is prescribed by xn = 0. Then v{x) = (0,..., 0,1) at all x e r and knowing the restriction of u to r, we also now all derivatives <9Q with a = (qi, ..., Qn-i, 0), 0 < \a\. Thus, we have to choose reasonably differentiable functions Cj on r, j1 = 0,..., k — 1, and posit for all j
dau(x) = Cj(x),   a = (0,..., 0, j),   x e T.
All the other derivatives 9a« on f, 0 < \a\ < oo with an < k are computed inductively by the symmetry of partial derivatives.
Moreover, if a(o,...,o,fc) 0, we can establish the remaining fc-th order derivative by means of the equation 9.2.12(1) and hope to be able to continue inductively. Indeed, writing a — a(o,...,o,k)(x,u, ■ ■ ■, Vfe_1(u)) 7^ 0 (and similarly leaving out the arguments of the other functions aa), the equation 9.2.12(1) can be rewritten as
(1)  ^TTW=-(-E  a^u + bix,^...,^-1^).
Now, on r we can use the already known derivatives to compute directly all the dau with an < k + 1. But differentiating
644
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
the latter equation by we obtain the missing derivative of order k + 1 from the known quantities on the right-hand side. By induction, we obtain all the derivatives, as requested.
In the general situation we can iterate the derivative Dl(x)u of u in the direction of the unit normal vector v to the hypersurface F\
Cauchy data for scalar PDE
The (smooth or analytic) Cauchy data for the fcth order quasi-linear PDE 9.2.12(1) consist of a hypersurface F c U andfc (smooth or analytic) functions Cj, 0 < j < k—1, prescribing the derivatives in the normal directions to r
(2) {Dl(x)yu{x) = Cj{x), xer.
A normal direction v(x), x e r, is called characteristic for the given Cauchy data, if
(3) aa(x,u,...,S7k-1u)v(x)a = 0.
\a\=k
The Cauchy data are called non-characteristic if there are no characteristic normals to r.
Notice the situation simplifies for the semi-linear equations. Then the characteristic directions do not depend on the chosen functions Cj from the Caychy data and they are directly related to the properties of the principal symbol of the equation.
For instance, semi-linear equations of first order always admit characteristic directions since their principal symbols are linear forms and so they must have non-trivial kernels (hy-perplanes of characteristic directions). In the three second order examples of the Laplace equation, diffusion equation, and wave equation very different phenomena occur. Since the symbol of the Laplace equation is a positive definite quadratic form, characteristic directions can never appear, independently of our choice of r. On the contrary, there are always non-trivial characteristic directions in the other two cases.
Characteristic cones of semi-linear PDEs
The characteristic directions of a semi-linear PDE on a domain fi C K™ generate the characteristic cone C(x) <zTxfi in the tangent bundle,
C{x) = {ieTxQ; P(x)(t,...,t) = 0}.
The Cauchy data on a hypersurface r are non-characteristic if and only if (TT)1- DC = {0}, i.e. the orthogonal complements to the tangent spaces to r with respect to the standard scalar product on R™ are never meet the characteristic cone.
Notice, cones for linear forms are hyperplanes in the tangent space, quadratic cones appear with second order, etc. As expected, the tangent vectors to characteristics of the first order quasilinear equations are characteristic in our new sense again. We have learned that the first order equations propagate the solutions along the characteristic lines and so we have to expect limitations on the possible Cauchy data along the characteristic cones again.
645
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.2.14. Cauchy-Kovalevskaya Theorem. As seen so many times already, the analytic mappings are very rigid and most questions related to them boil down to some estimates and smart combinatorial problems. It is time to remind what happens for analytic equations and Cauchy data in the very special case of theODEs.
For a single scalar autonomous ODE of first order, the Cauchy data consist of a single point "hypersurface" T = {x} in n C R and the value u(x). In particular, the Cauchy data are always non-characteristic in dimension one. Already in 6.2.15 we gave a complete proof that the induced derivatives of u provide a converging power series and thus the only solution, on certain neighborhood of x. In 8.3.13 we extended the same proof to autonomous systems of ODEs, which verified the same phenomenon for general systems of ODEs of any order k. Here the Cauchy data consist of one single point in r and all derivatives of the (vector) curve u of orders less than k (and again, they are always non-characteristic).
In subsequent paragraphs we shall comment on how to extend the ODE proof to the following very famous theorem.
Cauchy-Kovalevskaya theorem
Theorem. The analytic Cauchy problem consisting of quasi-linear equation 9.2.12(1) with analytic coefficients and right hand side, and analytic non-characteristic Cauchy data 9.2.13(2) has got a unique analytic solution on a neighborhood of each point in r.
Notice that we have computed explicitly the formal power series for the solution (by an inductive procedure) for the special case when r is denned by xn = 0. In this case, the theorem claims that this formal series always converges with non-trivial radius of convergence.
The full proof is very technical and we do not have space to bother the readers with all details. In the next paragraphs, we shall provide indications toward the steps in the proof. If the track (or interest) will be lost, the reader should rather jump to 9.2.18.
9.2.15. Flattening   the   non-characteristic   data. The
first step in the proof is to transform the non-characteristic data to the "flat" hypersurface r discussed in the beginning of 9.2.13. Remind that for such r the non-characteristic condition in 9.2.13(3) reads a(o,...,o,fc) 7^ 0.
Let us start with the general equation and its analytic Cauchy data on an analytic fcl" (we omit the arguments of all the functions and I = 0,..., k — 1)
(1)      ^2 aadau = b,    t^fix) = ce(x), x e r.
\a\=k
We shall work locally around some unspecified fixed point in r. Since r is an analytic hypersurface in R™, there
646
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
are new local coordinates y = 9(x), such that
r = {x; Wn(x) = 0}.
Moreover, <P can be chosen again analytic. Thus, the unit normal vector v to F equals to V9n, up to a multiple /i-1 at each point of r.
Let<P = ^_1,i.e. x = <P(y), and v(y) = u($(y)). Then <P is analytic and the equation transforms to another equation in the coordinates y,
(2) ^2 aadav = b
\a\=k
with analytic coefficients (they can be all expressed by means of the chain rule and the mutually inverse transformations <P, W). By the very definition, -g^- is a vector (the last column in the matrix D1^) perpendicular to r and thus it must be fiv (remind the product of the Jacobi matrices D1 ('P)D1 (<P) is the identity matrix and the rows V^, j = 1,..., n — 1 generate the tangent space TT).
Claim 1. The transformed Cauchy data for the equation (2) are analytic.
The hypersurface t given by yn = 0 as well as the coef-ficients of the equation are analytic. Compute j^pr on r.
v = ca o 4> dv d$      — du
~a~ = Vu ' ~a~ = l^u ■ V = fi— = /iCi oyn oyn dv
cPv      dfi du      2°Qu     dfi 2
dyn2     dyn dv        dv2 dyn 9s i>      d2ß du        dß d2u 3cP
+ 3ft*.. a.d.
u
dyn3     dyn2 dv        dyn dv2 dv3 d2 ß       „   dß o dyn dyn
Inductively, we see that the transformed functions Cj are obtained in an analytic way from the functions c{, i = 0,..., j.
Claim 2. The Cauchy data for the equation (1) are non-characteristic if and only if the transformed Cauchy data for (2) are non-characteristic, i.e. fi(o,...,o,fc) 0.
Compute using the chain rule (remind V\Pn is a vector, the gradient of the last coordinate function of <P, and it is equal to v, up to the non-zero multiple /i-1)
dau = jr~k (y^n)a + lot of terms of lower order in
-k
dyn
Substitute into (1)
0    ^TT" +'
^2 aadau = ß k ^2 flalyQ t^T + '
|a|=fc |q|
We have computed
\a\=k
647
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
which is non-zero if and only if the orginial Cauchy data were non-characteristic.
Since we already verified that all the partial derivatives of v along t can be computed for the non-characteristic Cauchy data with the flat hypersurface F, we have actually proved the following claim.
Proposition. The Cauchy data for (1) allow to compute all partial derivatives of its solution u along the hypersurface F if and only if the data are non-characteristic.
9.2.16. Reduction to a first-order system. Without loss of generality, we may consider only the Cauchy data of the form discussed in 9.2.13, i.e. the quasi-linear equation on a domain
inR™ is
(1) -—j-u=   ^ aadau + b
n \a\=k,an^:k
and r is given by xn = 0 with prescribed normal derivatives cq, ..., Ck-\. Moreover, we can subtract suitable fixed functions from u in order to transform the equation into a new one of the same shape and with all the Cauchy data Cj vanishing on r. Indeed, start with v = u — c0. This transforms the equation and Cauchy data so that the new c0 = 0. If we killed the functions c0,..., c^-i, then we may subtract g(xi,..., xn) = ^xlfifXxi,..., xn-{) which kills the next one.
The final reduction step is to introduce new functions for all components in the vector (u, Vu,..., Vfe_1u). Write v1,..., VN for all these functions, and add one more function v0(x) = xn. Then we can rewrite our equation (1) as a system of quasi-linear equations of first order on the vector function
v = (v0,. . . ,VN).
0 n~1 g
(2) J— Vs = ^    ^ asriTT-Vr + ks, S = 0, . . . , N,
0<r<N 1=1
where all the coefficients asri, b, are functions in x1,..., xn-i,vo,... ,vn and the boundary condition on r = {xn = 0} is v\r — 0.
Notice two important facts, the coefficients do not depend on xn and all derivatives on the left hand side are . This is a technicality which makes the problem similar to the autonomous systems of ODEs.
The principle is obvious from a simple example. Consider 2nd order equation with coefficients axx, axt, b
utt = axtuxt + axxuxx + b
and view t,u,ut,ux as unknown function. Then our equation rewrites as the system of four equations
= 1'    ~Eiu = u*'    ~EiUx = lkUti
at
ut = axt-§^ut + axx-^ux + b
with boundary condition t = u = ut = ux = 0 on the line t = 0.
648
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.2.17. The majorant. The rest of the proof is very technical, but it is based on the straightforward idea of a majorant for the Cauchy problem 9.2.16(2). Remind we used this method in 8.3.13 when proving the ODE version of the Cauchy-Kovalevskaya theorem.
We shall not go into much detail and only indicate how to generalize the method from 8.3.13. The first step is easy - we have seen already that all derivatives of the solution vector v at a fixed point x in r are computed by chain rule from the Cauchy data. This proves the uniqueness of the analytic solution, if such a solution exists.
Again, it turns out the derivatives are given via universal polynomials in the derivatives of the coefficients asri, bs of the system 9.2.16(2), and the polynomials have got non-negative real coefficients. The reader i may easily fill the details as an exercise. Now, a very similar majorant of the coeficients as in the ODE case can be chosen. First, the analycity of the coefficients ensures the existence of some suitably small r > 0 and (perhaps big) constant C such that jj\daasri\r\a\ < C, ±\dabs\r^ < C, and thus
\daasri\ < C|a|!r~'Q',    \dabs\ < C|a|!r~'Q',
for all coefficients and multiindices a.
In particular, all the coefficients can be majorized by the function
Cr
h(x!,... ,xn-i,v0,.. .,vN) =-^n-i-—jv-
En— 1 V^-j=l Xj ~ 2^s=0 Vs
Now, the majorizing system for the vector (Vo,..., Vjv) is
Ir = £ n£h±vr+h,s = 0,...,N.
0<r<iV 1=1
Since the coefficients are completely symmetric in the variables, let us expect the solution in the form
V"o = • • • = Vjv = W(x1 H-----h Xn-l,Xn),
i.e. W is a real function of two variables, say W(t, y). Substituting into the system we arrive at the linear first order PDE
dW _ Cr f dW
dy ~ r-t-NW(t,y){   {n~ > 8t +
with boundary condition W(t, 0) = 0. The reader may find the solution (e.g. by the method of characteristics)
W(t, y) = l-(r-t-yj{r-ty- 2nNCry ),
a real analytic function on a neighborhood of the origin. This concludes the proof.5
The reader may find detailed exposition in many basic books, for instance see the Chapter 1 of the book "Introduction to partial differential equations" by Gerald B. Folland, Princeton, 1995.
649
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
9.2.18. Back to second order equations. Let us finish our bried excursion to the PDE world by more de-' tailed comments on the second order quasilinear equations. We shall deal with scalar PDEs on a domain ficK" with one of the usual boundary conditions.
Thus consider a general linear operator
(1)   L = J£        ' ^ ai + a
written in coordinates (x1,..., xn). If we consider any other coordinate system y = <P(x), then the chain rule determines how to transform the equation Lu = f in coordinates x into Li, = f, where u{x) = v($(x)). Write V = (-^,..., gf-) for the gradient operator, dot for the standard scalar product of vectors, D1^ for the lacobi matrix of <P in the coordinates y and x,     for the ith column of D1^.
dx{ dx{     4^ &yk
k
d   =    mk mt  d2   |      <Pk d
dxidxt dx{ dxj dykdye     ^ dx{dxj dyk '
Thus, the operator (1) transforms into
In particular, the principal symbol transforms pointwise as a quadratic form under the linearized transformation D1^.
As we know from the linear algebra and geometry, the global behavior of real quadratic forms is classified by their singature, i.e. the number of positive and negative entries in the diagonalized matrix, cf. 4.2.6. This is transfered to the following
Classification of 2nd order quasi-linear PDEs
Consider a second order quasi-linear operator (1) with the principal symbol Q. The equations Lu = f and the operator L are called
• elliptic if Q is either positive or negative definite
• hyperbolic if Q has got the signature (n — 1,1) (or equvalently (1, n — 1))
• parabolic if Q is positive semidefinite with rank n — 1 and the equation can be rewritten as J^w = L(u), where the principal symbol of L depends on the remaining variables only.
Notice that we actually did not include all possibilities into the above list. We omited the ultra-hyperbolic case, where the rank of Q is maximal, with the remaining possibilities of signatures. Further, the parabolic equations could appear with the minus sign at (the so called "backwards parabolic equations"), and the rank of Q could also drop by more than one. Most of this cannot be seen in low dimensions.
650
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
If the coefficients atj, at and a are constant, then the principal symbol Q is a constant quadratic form, too, and we may choose just linear transformations T = D1® instead of general <P. This will always allow us to transform the quadratic form to its canoncial form over the entire domain fl.
Many particular examples are discussed explicitly in the other column, see ??, in particular in low dimensions.
Let us also stress that while in dimension n = 2 we may even locally integrate the necessary linearized transformations into a genuin mapping <P and get the canonical forms of the equations even in more general context, see ??, this is mostly not possible in higher dimensions.
Before coming to a few most important examples, let us look at the characteristic directions in the individual cases. If L is elliptic, then of course there cannot be any characteristic direction. So the (local) Cauchy problem prescribing the analytic value and first normal derivative along any analytic hypersurface r will have a locally converning analytic solution around any fixed point. Unfortunately, we have already encountered that this is not a well posed problem even for the two dimensional Laplace equation, cf. 9.2.11.
On the contrary, there will be a (n—1)-dimensional cone of characteristic direction at each point of a hyperbolic equation, while the parabolic equations will come equipped with a line of characteristic directions.
9.2.19. The wave equation. The wave operator in dimension n is
dt2
(1) L=^2- C A
o2 o2
where A is the Laplace operator A = + ■ ■ ■ + gjr, c2 > 0 a real constant. The operator L lives on domains in
Rn+1.
Let us first return to the 2D wave equation utt = c2uxx. We know the general solution
u{x,t) = f(x - ct) + g(y + ct),
cf. 9.2.6, the superposition of the forward and backward waves ui(x, 0) = f(x) and u2(x, 0) = g(x).
This perfectly matches our general expectation from the Cauchy-Kovalevskaya theorem that general solutions to quasi-linear second order equations in two variables should depend on two real single variable functions. Moreover, the characteristic directions are x ± ct = 0 and thus the line r = {t = 0} is non-characteristic.
Given Cauchy boundary data u(x,0) = <p(x), ■§iu(x, 0) = ip(x) and substituting the general solution, we arrive at
tp(x) = f(x) + g{x),    ip(x) = -cf'(x) + cg'(x),
where dash stays for the derivative of the single variable function, as usual. Thus, for any s0 and s
ip(x)dx = -f(s) + g(s) + f(s0) - g(s0).
651
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Add and substract the equations
1 fs
2g(s) = <p(s) + - ip{x)dx - f(s0) + g(s0) 2f(s) = <p(s) - - f i,(x)dx + /(s0) - _ (So).
C J so
Finally, substitute s = x — ct into /(s), and s = x + ct into g(s) in order to get the value of u(x, t) (notice the integrals add nicely, while the constants depending on the choice of s0 cancel).
1 1 rx+ct
(2) u(x,y) = -(ip(x-ct)+ip(x+ct)) + — / ip{y)dy.
1 Lc Jx-ct
This solution is often called the D'Alembert's solution.
The formula also reveals the continuous dependence of the solution on the boundary conditions. We may conclude that the Cauchy problem seems to be the right boundary value problem for the wave equation (although we have fully proved that it is well-posed only in the dimension two and the analytic category).
In higher dimensions, the situation is much more complicated. One of useful options is to employ the method of separation of variables, but splitting only the time and space variables. Consider the n-dimensional wave equation and expect the solution in the form u(x,t) = F(x)T(t) (now x e K™, t £ R). Plugging this into (1) and playing the separation method game, we arrive at two equations
AF + aF = 0,   T" + a^-T = 0,
cz
where a is the separation constant (usually we consider either a2 or —ft2 to fix the types of the equations). The first equation is called the Helmholtz equation and we shall come back to it below.
9.2.20. The diffusion equation. In general dimension n the diffusion operator is
(i) L = i- kA>
at
k > 0, the diffusion equation is considered on domains in
Rn x R.
Again, let us have a look at the simplest ID diffusion equation ut = kuxx. It describes the diffusion process in a one-dimensional object with diffusivity k (assumed to be constant here) in time. First of all, let us notice that the usual boundary value presription of the state at time t = 0 is not matching the assumption of the Cauchy-Kovalevskaya theroem. Indeed, taking r = {t = 0}, the normal direction vector ■§? is characteristic.
652
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
The intuition related to the expectation on diffusion prob-,, lems suggests that Dirichlet boundary data should suf-[wj?> fice (we just need the inicial state and the diffusion r\£jj, then does the rest), or we can combine them with some Neumann data (if we supply heat at some parts of the boundary). Moreover, the process should not be reversible in time, so we should not expect that the solution would extend accross the line t = 0.
Let us look at a classical example considered already by Kovalevskaya. Posit
u(0,x)=g(x) =
on a neighborhood of the origin (perfect analytic boundary data and equation), and expect u is a solution of ut = uxx in
the form u(t, x) = J2k,e>o C<MIT 7T-
The equation obviously implies the relations ck+iti = for all k, I. Further, the power series of (1 + a;2)-1 = J2e(~t)ex2e is obtained from the geometric power series with argument —x, and with a;2 substituted for x in the end.
Thus, for all £, c0^e+i = 0, c0^e = (-l)£(2n)!. By the recurrence, ck,2i = (-l)k+e(2(k + £))!.
This is a too quick growth for a converging power series. For example, looking at the terms
1       _ (4fc)! m}.Ck'k ~ fc!(2fc)!'
they grow as fast towards the infinity as the expression e-k /jfcg2fc; by the Stirling formula for the factorial (cf. 6.2.17).
We have learned that there cannot be any analytic solution to our Dirichlet boundary problem at all. This example also shows the relevance of all assuptions in the Cauchy-Kovalevskaya theorem.
9.2.21. Diffusion via Fourier transform. Fortunately, another straightforward method helps us to solve the simplest diffusion equation with Dirichlet ^jNfc   data.  Let us assume u(x, t) is a solution of
^r^iV— ut = kuxx, u(x,0) = p(x). Remind, the Fourier transform (with respect to x) transfers the differentiation g| to algebraic multiplication by ix, while the other variable t remains as parameter.
Thus, the Fourier image u(£, t) must obey
ik = -k£2u,      m(C,0) = <f.
This is a quite simple ODE problem with the general solution
while the initial condition implies that the integration constant isjustC(C) = <fi.
Now remember the relation between the Fourier transform and convolution, 7.2.7 at page 478. The image of the convolution is the product of the images, up to the factor V^tt. Thus we shall immediately write down the solution u with
653
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
t > 0, once we find the inverse Fourier image of the Gaussian /(£) = e~Kt$ . But Fourier images T(f) of Gaussians / are again Gaussians, up to constant, see ??,
with any real constant a > 0. Thus, we can write for t > 0
Finally, we obtain u as the convolution of the initial condition with the so called heat kernel function
u(x,t) =     —   /     e    4»' <P(y)dy.
Obviously, the solution depends continuously on the boundary condition. We may imagine it models the dynamics of temperature in an inifinte homogeneous bar, with some initial distribution of temperature and no losses or gains of energy in time.
Let us also observe the behavior of the solution for t close the zero. As mentioned in 7.2.9, the Gaussians with variance converging to zero are a good approximation for the so called Dirac delta functions, and indeed the limit of the convolution for t —> 0+ is exactly the function ip, as expected.
We shall come back to such convolution based principles a few pages later, after investigating simpler methods.
9.2.22. Superposition of the solutions. A general idea to solve boundary value problems is to take a good supply of general solutions and try to take linear combination of even infinite many of them. This
'^r^^s-j— means we consider the solution in a form of a
series. The type of the series is governed by the available
solutions.
Let us illustrate the method on the the diffusion equation discussed above. Imagine we want to model the temperature of a a homogeneous bar of length d. Initially, at time t = 0, the temperature at all points x is zero. At one of its ends we keep the temperature zero, while the other end will be heated with some constant intensity. Set the bar as the interval x G [0, d] c K, and the domain Q = [0, d] x [0, oo). Our boundary problem is
d
(1) ut = kuxx, u(x, 0) = 0, u(0, t) = 0, -7feu(d, t) = p,
where p is a constant representing the effect of the heating. The idea is to exploit the general solutions
2 ,
u(x, t) = (A cos ax + B sin ax) ea
from 9.2.6 with free parameters a, A, and B. We want to consider a superposition of such solutions with properly chosen parameters and get the solution to our boundary problem in the form combining Fourier series terms wit the exponentials. This approach is often called the Fourier method.
The condition u(0,t) = 0 suggests to restrict ourselves to A = 0. Then, ux (x, t) = Ba cos(aa;) e~Q Kt. It seems to be difficult now to guess how to combine such solutions, to
654
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
get something constant in time, as the Neumann part boundary condition requests. But we can help with a small trick. There are some further obvious solutions to the equation -those with u depending on the space coordinate only. We may consider
v(x, t) = px
and seek further our solution in the form u(x, t)+v(x). Then u must again be a solution of the same diffusion equation (1), but the boundary conditions change to u(x, 0) = —px, u(0,t) = 0, £u(d,t) = 0. Now, we want
ux(x, t) = Bacos(ax) e~a'iKt = 0,
i.e. we should restrict to frequencies a = j^n-ir, with odd non-negative integers n. This has settled the second of the boundary condition. The remaining one is u(x, 0) = —px which sets the condition on the coefficients B in the superposition
}_^B2k+i sin(-—-) = -px
k>0
on the interval x e [0, d\. This is a simple task of finding the Fourier series of the function x, which we handled in 7.1.10. Combining all this, we get the requested solution u(x, t) to our problem:
px - 8pd±      m+W sin((2k+2d)7iX) e~K<"«»" ' •
fc>0
Even though our supply of general solutions was not big, superposing countably many of them helped us to solve our problem. Notice the behavior at the heated end. If t —> oo, then the all exponential terms in the sum vanish faster than the very first one, the sine terms are bounded, and thus the entire component with the sum vanishes quite fast. Thus, for big t, the heated end will increase its temperature nearly linearly with the speed p.
9.2.23. Separation in transformed coordinates. As we
have seen several times, it is very useful to *+__ view a given equation rather as an inpendent object expressed in some particular coordinates. The practical problems mostly include some symmetries and then we should like to find some suitable coordinates in order to see the equation in some simple form.
As an example, let us look at the Laplace operator A in the polar coordinates in the plane, and cylindrical or spherical coordinates in the space. Writing as usual x = r cos <p, y = r sin ip for the polar transformation, the Laplace operator gets the neat form
(1) 4 = ii i i Ii = Ii(ri) i lii
V. ) Qr2       r2 Qyj2       r dr       r dr \   dr)       r2 dip2
The reader should perform the tedious but straightforward computation. Similarly,
(2) A = ^-§z(r4-) + ^2JT72 +
Sr V 8r)       r2 dip2       dz2 '
(2) A = ±-2-(r2-^-) H__K__H__i__2-fsin^^,
V ) r2 dry     dr)       r2 sin2tp dip2       r2sin^9^\       ^ dip)
655
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
in the cylindrical and spherical coordinates, respectively.
Let us illustrate the use on the following problem. Imagine a twisted circular drum, whose rim suffers a small vertical displacement. We should model the stabilized position of the drumskin.
Intuitively, we should describe the drumskin position by the 2D wave equation, but since we are interested in the state with -gr vanishing, we actually take u as the vertical displacement in the interior of the unit circle, fl = {x2 + yy < 1} C R2 and request Au = 0, subjet to the Dirichlet boundary problem prescribing the vertical displacement u(x,y) = j(x,y) of the rim.
Obviously, we want to consider the problem in the polar coordinates, where the boundary condition gets the neat form u(l,<p) = g(p). Say g(p) = esin p + e2 sin 5p with some small constant e > 0.
We shall apply the separation of variables method to these data. Expecting the solution in the form u(r,p) = R(r)<P(p), the equation implies (after dividing by R<P)
R"    1R'     1 <P"
R     r R     r2 $ Thus, multiplying by r2 and considering the separation constant ft2, we arrive at two ODEs
_>" + a2<P = 0,   r2R" + rR' - a2R = 0.
Fortunately, they are both easy to solve. From the first equation (with ft > 0)
$(<p) = A cos ap + B sin ap,
while the other equation transform by S(t) = i?(exp t) into an equation with constant coefficients and its solution yields
R(r) = Cra +Dr~a.
Actually, with ft = 0 the solution for R is R(r) = C In r + D while this case is included in the solution for <P above.
In fact, we insist the solution u = R<P be a single valued function in the plane and so we can allow only integer values of ft, including a = 0 when <P becomes a constant (again, any non-zero multiple would lead to multi-valued solutions u). Thus the general solution of the Laplace equation coming from the separation of variables method and superposition is
u(r, p) = C0 lnr + D0+
(4)
22(An cos nip + Bn smnp)(Cnrn + Dnr~n).
71=1
In our problem, we clearly insist in having u finite at the origin and thus all Dn and the Co have to vanish. Now we can employ the boundary condition
00
Dq +      cn (An cos np + Bn sin np) = e sin p + e sin 5p.
71 = 1
This is a very simple case of the Fourier series and we see immediately that all the coefficients have to vanish except of the B1 and B5 and the requested solution is
u(r, p) = er sin p + er5 sin 5p.
656
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
The higher the frequency of the twist of the rim, the slower the distorsion develops in the center of the drumskin. The method works for every boundary condition u(l, p) = g(<p), if we are able to find its Fourier series.
9.2.24. The Helmholtz equation. In 9.2.19, we looked for solutions of the nD wave equation utt —<? Au = 0 in the form
u(x,t) = F(x)T(t), where x eRn,t eR. This (partial) separation of variables with negative separation constant —a2 leads to the Helmholtz equation
AF + aF = 0,
together with the easily solved T" + a2T = 0.
Let us treat the 2D case in the polar coordinates, again using separation of variables. Thus, the Helmholtz equation gets the form
1  9 I    9 ^        1    92 2n „
--(r —) H----h a F = 0
r dr    dr      r2 dp2
and we seek for F(r, p) = R(r)<P(p). Writing f? for the separation constant now, we arrive at the two equations (the second one is multiplied by r2, for convenience)
<2>" + p2$ = 0,   r2R" + rR' + (a2r2 - /32)i? = 0.
The angular component equation has got the obvious solutions A cos /3p + B sin ftp, and we have again to restrict (3 to integers in order to get single-valued solutions. With (3 = m, the radial equation is the well known Bessel's ODE of order m (notice our equation gets the form we had in ?? once we substitute z = ar), with the general solution
R(r) = CJm(ar) + DYm(ar),
where Jm and Ym axe, the special Bessel functions of the first and second kinds.
We have obtained a general solution which is very useful in practical problems, cf. ??.
Non-homogeneous equations. Finally, we add a few comments on the non-homogeneous linear PDEs. , T Although we provide arguments for the claims, we - *| shall not go into technical details of proofs because of the lack of space. Still, we hope this limited insight will motivate the reader to seek for further sources to learn more.
As always, facing a problem Lu = /, we have to find a single particular solution to this problem, and we may then add all solutions to the homogeneous problem Lu = 0. Thus, if we have to match say Dirichlet conditions u = g on the boundary dfl of a domain fl, and we know some solution w, i.e. Lw = / (not taking care of the boundary conditions), than we should find a solution v to the homogenenous Dirichlet problem with the boundary condition g — w^n. Clearly the sum u = v + w will solve our problem.
In principle, we may always consider superpositions of known solutions as in the Fourier method above. We shall present a more conceptual and general approach now briefly.
657
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Let us come back to the ID diffusion equation and our solution of a homogeneous problem by means of the Fourier transform in 9.2.21. The solution of ut = kuxx with u(x, 0) = p is a convolution of the boundary values u(x, 0) with the heat kernel
(1) Q{t,x) = -=l=e-& .
VVKfti
Now, the crucial observation is that Q(t, x) is a solution to L(u) = ut — kuxx = 0 for all x and t > 0, while on neig-borhood of the origin it behaves as the Dirac delta function in the variable x.
The latter observation suggests, how to find the particular solutions to a non-homogeneous problem. Clearly, for any fixed (x, s), the function Q(x — y,t — s) will be again a solution to L(u) = 0 and it will behave as the Dirac delta at (x, s). Consider the integral of the convolution
(2)   u(x,t)= /     / Q(x-y,t-s)f(y,s)dy)ds
The derivative ut will have two terms. In the first one we differentiate with respect to the upper limit of the outer integral, while the other one is the derivative inside the integrals. The derivatives with respect to x are evaluated inside the in-tegrals. Thus, in the evaluation of L = — the terms inside of the integral cancel each other (remember Q is a solution for all x, and t > 0) and only the first term of ut survives. It seems as obvious that this term is the evaluation of the integrand with s = t. Although, these values are not properly defined, we may verify this claim in terms of taking limit (t - s) -> 0+. But this leads to
lim   /     Q(x-y,t-s)f(y,s)dy = f(x,s).
Thus, (2) is a particular solution and clearly u(x, 0) = 0.
The solution of the general Dirichlet problem L(u) = /, u(x, 0) = p on fi = R x [0, oo) is
/oo Q(x - y,t)p(y)dy +
(3)
K J rt / poo \
J     Q(x-y,t-s)f(y,s)dyjds
Let us summarize the achievements and try to get generalization to general dimensions.
First, we can generalize the heat kernel function Q writing its nD variant depending on the distance r from the origin only. Consider the formula with x e R™ as the product of the ID heat kernels for each of the variabels in x.
1 IMI2
(4) Q(t,x) =
\/(47TKi)n
Then taking the n-dimensional (iterated) convolution of Q with the boundary condition p on the hyperplane t = 0 provides the solution candidate
(5)       u{x,t) = / Q(x-y,t)ip(y)dx1...dxn.
658
CHAPTER 9. CONTINUOUS MODELS - FURTHER SELECTED TOPICS
Indeed, a straightforward (but tedious) computation reveals that Q is a solution to L(u) = 0 in all points (x, t) with t > 0, and Q behaves again as the Dirac delta at the origin. In particular (5) is a solution to the Dirichlet problem L(u) = 0, u(x, 0) = p and we can allso obtain the non-homogeneous solutions similarly to the ID case.
9.2.26. The Green's functions. The solutions to the (non-homogeneous) diffucion equation constructed in the last paragraph are built on a very simple idea - we find a solution to our euqation which is defined every wehere expcept in the origin and blows up in the origin at the speed making it into a Dirac delat function at the origin. A convolution with such kernel is then a good candidate for solutions. Let us try to mimic this approach for the Laplace and Poisson equations now.
3. Remarks on Variational Calculus
ABOUT I5PP - BASIC FORMULATIONS AND RESULTSON FIRST AND SECOND VARIATIONS, CONSTRAINTS, REMARKS TOWARDS NUMERICS
4. Complex Analytic Functions
ABOUT I5PP - BASICS OF FUNCTION THEORY BUT STARTING EXCLUSIVELY WITH POWER SERIES, INCLUDES CAUCHY INTEGRAL AND FORMULA, RESIDUES, CONFORMAL MAPS ETC.
659
CHAPTER 10
Statistics and probability methods
Is statistics a part of mathematics ?
- whenever it is so, we need much of mathematics there... !
A. Dots, lines, rectangles
The obtained data from reality can be displayed in many ways. Let us illustrate some of them.
10.A.1. Presenting the collected data. 20 mathematicians were asked about the number of members of their household. The following table displays the frequency of each number of members.
Number of members	1	2	3	4	5	6
Number of households	5	5	1	6	2	1
Create the frequency distribution table. Find the mean, median and mode of the number of members. Build a column diagram of the data.
Solution. Let us begin with the frequency distribution table. There, we write not only the frequencies, but also the cumulative frequencies and relative frequencies (i. e., the probability that there is a given number of members in a randomly picked household). Let us denote the number of members by x{, the corresponding frequency by n{, the relative frequency by pi (= ni/ J2&j=i nj = ni/20), the cumulative frequency by iV, (= Yl)=i xj)< ar*d me relative cumulative frequency by F{
Roughly speaking, statistics is any processing of numerical or other type of data about a population of objects and their presentation. In this context, we talk about descriptive statistics. Its objective is thus to process and comprehensibly represent data about objects of a given "population" — for instance, the annual income of all citizens obtained from the complete data of revenue authorities, or the quality of hotel accommodation in some region. In order to achieve this, we focus on simple numerical characterization and visualization of the data.
Mathematical statistics uses mathematical methods to derive conclusions valid for the whole (potentially infinite) population of objects, based on a "small" sample. For instance, we might want to find out how much a certain disease is spread in the population by collecting data about a few randomly chosen people, but we interpret the results with regard to the entire population. In other words, mathematical statistics makes conclusions about a large population of objects based on the study of a small (usually randomly selected) sample collection. It also estimates the reliability of the resulting conclusions.
Mathematical statistics is based on the tools of probability theory, which is very useful (and amazing) in itself. Therefore, probability theory is discussed first.
This chapter provides an elementary introduction to the methods of probability theory, which should be sufficient for correct comprehension of ordinary statistical information all around us. However, for a serious understanding of a mathematical statistician's work, one must look for other resources.
1. Descriptive statistics
Descriptive statistics alone is not a mathematical discipline although it uses many manipulations with numbers and sometimes even very sophisticated methods. However, it is a good opportunity for illustrating the mathematical approach to building generally useful tools.
At the same time, it should serve as a motivation for studying probability theory because of later applications in statistics.
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
(= A^/20 = E^iPj):
	n{	ft	Ni	Fi
1	5	1/2	5	1/4
2	5	1/4	10	1/2
3	1	1/20	11	11/20
4	6	3/10	17	17/20
5	2	1/10	19	19/20
6	1	1/20	20	1
Now, we can easily construct the wanted (column) graphs of (relative, cumulative) frequencies:
The mean number of members of a household is:
5-1 + 5-2 + 1-3 + 6-4 + 2-5 + 1-6
20
= 2.9.
The median is the arithmetic mean of the tenth and eleventh values (having been sorted), which are respectively 2 and 3, i. e., x = 2.5.
The mode is the most frequent value, i. e., x = 4.
The collected data can also be presented using a box plot:
The upper and lower sides of the "box" correspond respectively to the first (lower) and the third (upper) quartile, so its height is equal to the interquartile range. The thick horizontal line is drawn at the median level; the lower and upper horizontal lines correspond respectively to the minimum and maximum elements of the data set, or to the value that is 1.5 times the interquartile range less than the lower side of the box (and greater than the upper side, respectively). The data outside this range would be shown as circles.
We can also build the histogram of the data:
In our brief introduction, we first introduce the concepts allowing to measure the positions of data values and the variability of the data values (means, percentiles etc.). We touch the problem how to visualize or otherwise present the data sets (diagrams). Then we deal with the potential relations between more data sets (covariance and principal components) and, finally, we deal with data without numerical values relying just on their frequencies of appearance (entropy).
10.1.1. Probability, or statistics? It is not by accident that rrs>     _    we return to a part of the motivating hints 'C    m^z?^\ fr°m me first chapter, as soon as we have
managed to gather enough mathematical ■#//' • tools both discrete and continuous. Nowadays, many communications are of a statistical nature, be it in media, politics, or science. Nevertheless, in order to properly understand the meaning of such a communication and using particular statistical methods and concepts, one must have a broad knowledge of miscellaneous parts of mathematics. In this subsection, we move away from the mathematical theory; and think about the following steps and our objectives.
As an example of a population of objects, consider the students of a given basic course. Then, the examined numerical data can be:
• the "mean number of points" obtained during the course in the previous semester and the "variance" of these values,
• the "mean marks" for the examination of this and other courses and the "correlation" (i.e. mutual dependence) of these results,
• the "correlation" of data about the past results of given students,
• the "correlation" of the number of failed exams of a given student and the number of hours spent in a temporary job,
• ...
With regard to the first item, the arithmetic mean itself does not carry enough information about the quality of the lecture or of the lecturer, nor about the results of particular students. Maybe the value which is "in the middle" of the population, or the number of points achieved by the student who was just better than half of the students is of more concern. Similarly, the first quarter, the last quarter, the first tenth, etc. maybe of interest. Such data are called statistics of the population. Such statistics are interesting for the students in question as well, and it is quite easy to define, compute, and communicate them.
From general experience or as a theoretical result outside mathematics, a reasonable assessment should be "normally" distributed. This is a concept of probability theory, and it requires quite advanced mathematics to be properly defined. Comparing the collected data about even a small random population of students to theoretical results can serve in two ways: We can estimate the parameters of the distribution as well as draw a conclusion whether the assessment is reasonable.
661
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Histogram of x
Note that the frequencies of one- and two-member households were merged into a single rectangle. This is used in order to make the data "easier to read" - there exist (various and ambiguous) rules for the merging.
We simply mention this fact without presenting an exact procedure (it is just as anyone likes). □
10.A.2. Given a data set x = (x1,x2, ■ ■ ■ ,xn), find the mean and variance of the centered values x{ — x and the standardized values ^f2- ■
Solution. The mean of the centered values can be found directly using the definition of arithmetic mean:
^    n 1    n ~ n
— /   (Xi — x) = — >   Xi-->   I = x — X = 0.
n ^ '     n ^        n ^
The variance of the centered values is clearly the same as for the original ones (sx). For the standardized values, the mean is equal to zero again, and the variance is
□
10.A.3. Prove that the variance satisfies sx = i £™=1 x1~ x2.
Solution. Using the definitions of variance and arithmetic mean, we get:
1
Yl  - 2x-x+x2)= \ J2x2 - ^ J2x<+■
i=l i=l i=l
1 n
1 \ " 2 -2
2x ■
At the same time, the numerical values of statistics for a given population can yield qualitative description of the likelihood of our conclusions. We can compute statistics which reflect the variability of the examined values, rather than where these values are positioned within a given population. For instance, if the assessment does not show enough variability, it may be concluded that it is badly designed, because the students' skills are of course different. The same applies if the collected data seem completely random.
In the above paragraph, it is assumed that the examined data is reliable. This is not always the case in practice. On the contrary, the data is often perturbed with errors due to construction of the experiment and the data collection itself.
In many cases, not much is known about the type of the data distribution. Then, methods of non-parametric statistics are often used (to be mentioned at the end of this chapter). Very interesting conclusions can be found if we compare the statistics for different quantities and then derive information about their relations. For example, if there is no evident relation between the history of previous studies and the results in a given course, then it may be that the course is managed wrongly.
These ideas can be summarized as follows:
• In descriptive statistics, there are tools which allow the understanding of the structure and nature of even a huge collection of data;
• in mathematics, one works with an abstract mathematical description of probability, which can be used for analysis of given data. Especially, this is when there is a theoretical model to which the data should correspond;
• conclusions of statistical investigation of samples of particular data sets can be given by mathematical statistics;
• mathematical statistics can also estimate how adequate such a description is for a given data set.
10.1.2. Terminology. Statisticians have introduced a great many concepts which need mastering. The fundamental concept is that of a statistical population, which is an exactly denned set of basic statistical units. These can be given by enumeration or by some rules, in case of a larger population.
On every statistical unit, statistical data is measured, with the "measurement" perceived very broadly.
For instance, the population can consist of all students of a given university. Then, each of the students is a statistical unit and much data can be gathered about these units - the numerical values obtainable from the information system, what is their favorite colour, what they had for dinner before their last test, etc.
The basic object for examining particular pieces of data is a data set. It usually consists of ordered values. The ordering can be either natural (when the data values are real numbers, for example) or we can define it (for instance, when we observe colours, we can express them in the RGB format
662
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
□
10.A.4.   The following values have been collected:
10; 7; 7; 8; 8; 9; 10; 9; 4; 9; 10; 9; 11; 9; 7; 8; 3; 9; 8; 7.
Find the arithmetic mean, median, quartiles, variance, and the corresponding box diagram.
Solution. Denoting the individual values by a{ and their frequencies by rij, we can arrange the given data set into the following table.
a{	3	4	7	8	9	10	11
	1	1	4	4	6	3	1
From the definition of arithmetic mean, we have
_ 3 + 4 + 4- 7 + 4- 8 + 6- 9 + 3-10 + 11 _ 162 X~ l + l + 4 + 4 + 6 + 3 + l ~ ~2Q ~
Since the tenth least collected value is X(10) = 8 and the eleventh one is X(u) = 9, the median is equal to x = §±2 = 8.5. The first quartile is x0.25 = x<5>+x<6> = 7, and the third quartile is 20.75 = x<15>+x<16> = g_ From the definition of variance, we get s2:
5.12 + 4.12 + 4 ■ l.l2 + 4 ■ 0.12 + 6 ■ 0.92 + 3 ■ 1.92 + 2.92 1+1+4+4+6+3+1
= 3.59.
The histogram and box diagram are shown in the following pictures.
Histogram of x
to "I I-
in -
f  - I-1-
l
CD
3     n - ...
o"
■p
Li_
Cy -   i-1
o -■   1-1-1-1-1-1-1-'
I-1-1-1
4 6 8 10
X
where we have used "statistics" method to make the histogram "nice" and "clear". You can find a lot of these conventions in the books on statistics, but if you do not know them,
and order them with respect to this sign). We can also work with unordered values.
Since statistical description aims at telling comprehensible information about the entire population, we should be able to compare and take ratios of the data values. Therefore, we need to have a measurement scale at our disposal. In most cases, the data values are expressed as numbers. However, the meaning of the data can be quantified variously, and thus we distinguish between the following types of data measurement scales.
Types of data measurement scales The data values are called:
• nominal if there is no relation between particular values; they are just qualitative names, i.e. possible values (for instance, political parties or lecturers at a university when surveying how popular they are);
• ordinal the same as above, but with an ordering (for example, number of stars for hotels in guidebooks);
• interval if the values are numbers which serve for comparisons but do not correspond to any absolute value (for example, when expressing temperature in Celsius or Fahrenheit degrees, the position of zero is only conventional);
• ratio if the scale and the position of zero are fixed (most physical and economical quantities).
With nominal types, we can interpret only equalities x\ = x2\ with ordinal types, we can also interpret inequalities x\ < x2 (or x\ > x2); with interval types, we can also interpret differences x1 — x2. Finally, with rational types, we have also ratios x1/x2 available.
10.1.3. Data sorting. In this subsection, we work with a dataset x1,x2,... ,xn, which can be ordered (thus, theirtype is not nominal) and which have been obtained through measurement on n statistical units. These values are sorted in a sorted data set
(1) x(1),a;(2),...,a;(Tl).
The integer n is called the size of the data set.
When working with large data sets where only a few values occur, the simplest way to represent the data set is to enumerate the values' frequencies.
For instance, when surveying the political party preference or when presenting the quality of a hotel, write only the number of occurrences of each value.
If there are many possible values (or there can even be continuously distributed real values), divide them into a suitable number of intervals and then observe the frequencies in the given intervals. The intervals are also called classes and the frequencies are called class frequencies. We also use cumulative frequencies and cumulative class frequencies which correspond to the sum of frequencies of values not exceeding a given one.
663
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
you are lost. This is the default setting of the R program. For example if you replace just value 3 by 2 you get quite different looking histogram:
Histogram of x
—i
12
□
Most often, the mean at of a given class is considered to be its representative, and the value a^n, (where n{ is the frequency of the class) is the total contribution of the class. Relative frequencies a, /n, and relative cumulative frequencies, can also be considered.
A graph which has the intervals of particular classes on one axis and rectangles above them with height corresponding to the frequency is called a histogram. Cumulative frequency is represented similarly.
The following diagram shows histograms of data sets of size n = 500 which were randomly generated with various standard distributions (called normal, x2, respectively).
10.1.4. Measures of the position of statistical values. If
the magnitude of values around which the collected data values gather are to be expressed, ^^1~§»- then the concepts of the definition below can be used. There, we work with ratios or interval types of scales.
Consider an (unsorted) data set (x1,..., xn) of the values for all examined statistical units and let tii ,..., nm be the class frequencies of m distinct values ai,..., an that occur in this set.
Means
10.A.5. 425 carps were fished, and each one was taken weighed. Then, mass intervals were set, resulting in the following frequency distribution table:
Weight (kg)	0-1	1-2	2-3	3-4	4-5	5-6	6-7
Class midpoint	0.5	1.5	2.5	3.5	4.5	5.5	6.5
Frequency	75	90	97	63	48	42	10
Draw a histogram, find the arithmetic, geometric, and harmonic means of the carps' weights. Furthermore, find the median, quartiles, mode, variance, standard deviation, coefficient of variation, and draw a box plot.
Solution. The histogram looks as follows:
Definition. The arithmetic mean (often only mean) is given as
x
——Xi = — 7i,-
71 ^ 71
= 1 3 = 1
The geometric mean is given as
X
=  y,X1X2 ---Xn
and makes sense for positive values xt only. The harmonic mean is given as
\ 71 L—' Xj
\     i=l 1
and is also useded for positive values x{ only.
The arithmetic mean is the only one of the three above which is invariant with respect to affine transformations. For all scalars a, b,
^    n n
(a + b ■ x) = — >  (a + bxA = a + b}   X{ = a + b ■ x.
664
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
K-5d=.16447, ; LI lliatm pt ,01
Ccekavane normal rti
0 1 ? 3 4 S
Therefore, the arithmetic mean is especially suitable for interval types.
The logarithm of the geometric mean is the arithmetic mean of the logarithms of the values. It is especially suitable for those quantities which cumulate multiplicatively, e. g. interests. If the interest rate for each time period is x{%, then the final result is the same as if the interest rate had constant value of xG%. See 10.A.9 for an example where the harmonic mean is appropriate.
In subsection 8.1.29 (page 547), we use the methods invented there to prove that the geometric mean never exceeds the arithmetic mean. The harmonic mean never exceeds the geometric mean, and so
From the definitions of the corresponding concepts in subsection 10.1.4, we can directly compute that the arithmetic mean is x = 2.7 kg, the geometric mean is xG = 2.1 kg, and the harmonic mean is xH = 1.5 kg. By the definitions of subsection 10.1.5, the median is equal to x = x0,5 = 2.5 kg, the lower quartile to 20.25 = 1.5 kg, the upper quartile to 20.75 = 3.5 kg, and the mode is 2 = 2.5 kg. From the definitions of subsection 10.1.6, we compute the variance of the weights, which is sx = 2.7 kg2, whence it follows that the standard deviation is sx = 1.7 kg, and the coefficient of variation is Vx = 0,6. □
10.A.6. Prove that the entropy is maximal if the nominal values are distributed uniformly, i. e., the frequency of each class is rii = 1.
Solution. By the definition of entropy (see 10.1.11), we are looking for the maximum of the function Hx = — Y17=iPi^nPi with resPect to unknown relative frequencies Pi = which satisfy Y^t=iPi — 1-Therefore, this is a typical example of finding constrained extrema, which can be solved using Lagrange multipliers. The corresponding Lagrange function is
L(pi, ■ ■ ■ ,Pn, A) = - ^Pilnpi + A   ^Pi - 1
The partial derivatives are J^- = — kip, — 1 + A, hence its stationary points is determined by the equations pi = eA_1 for alH = 1,..., n. Moreover, we know that the sum of the relative frequencies pi is equal to one. This means that 7ieA_1 = 1, whence we get A = 1 — Inn. Substitution then
10.1.5. Median, quartile, decile, percentile, ... Another way of expressing the position or distribution of the values is to find, for a number a between zero and one, such a value xa that a 100% of values from the set are at most xa and the remaining ones are greater than xa. If such a value is not unique, one can choose the mean of the two nearest possibilities.
The number xa is called the a-quantile. Thus, if the result of a contestant puts him into 21.00, it does not mean that he is better than anyone else yet. However, there is surely no one better than him.
The most common values of    tire the following:
• The median (also sample median) is denned by
X = 20.50
for odd 71
I (x(n/2) + Z(n/2+i))   for even 71
yields pi
□
where x^ corresponds to the value in the sorted data set 10.1.3(1).
• The first and third quartile are Qi = 20.25 and Q3 = 20.75, respectively.
• The p-th quantile (also sample quantile or percentile) xp, where 0 < p < 1 (usually rounded to two decimal places).
One can also meet the mode, which is the value 2 that is most frequent in the data set 2.
The arithmetic mean, median (with ratio types), and mode (with ordinal or nominal types) correspond to the "anticipated" values.
Note that all a-quantiles with interval scales are invariant with respect to affine transformations of the values (check this yourselves!).
10.1.6. Measures of the variability. Surely any measure of the variability of a data set 2 e R™ should be invariant with respect to constant translations. In the Euclidean space R™, both the standard distance and the sample mean have this property. Therefore, choose the following:
665
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.A.7. The following graphs depict the frequencies of particular amounts of points obtained by students of the MB 104 lecture at the Faculty of Informatics of Masaryk University in 2012. The axes of the cumulative graph are "swapped", as opposed to the previous example.
The frequencies of particular amounts of points are enumerated in the following table:
# of points	# of students
20.5	1
20	1
19	2
18.5	1
18	2
17.5	3
17	2
16.5	4
16	3
15.5	5
15	7
14.5	6
14	14
13.5	21
13	21
12.5	19
12	17
11.5	18
11	31
10.5	22
10	53
# of points	# of students
9.5	9
9	9
8.5	13
8	8
7.5	13
7	4
6.5	7
6	4
5.5	8
5	7
4.5	9
4	5
3.5	7
3	8
2.5	8
2	14
1.5	8
1	2
0.5	6
0	9
The corresponding histogram looks as follows:
The histogram was obtained from the Information System of Masaryk University. We can see that the data are shown in a somewhat unusual way: individual amounts of points correspond to "double rectangles". It is a matter of taste how to represent the data (it is possible to merge some
Variance and standard deviation
Definition. The variance of a data set x is defined by
1 "
i=l
The standard deviation sx is defined to be the square root of the variance.
As requested, the variability of statistical values is independent of constant translation of all values. Indeed, the un-sorted data set
y = (x1 + c, x2 + c,..., xn + c)
has the same variance sy = sx.
Sometimes, the sample variance is used, where there is (n — 1) in the denominator instead of n. The reason will be clear later, cf. 10.3.2.
In case of class frequencies rij of values a, for m classes, this expression leads to the value
of the variance. In practice, it is recommended to use the Sheppard's correction, which decreases s2, by h2/12, where h is the width of the intervals that define the classes. Further, one can encounter the data-set range
R = xi
and the interquartile range
^3
The mean deviation, which is defined as the mean distance of the values from the median:
1 71
71 -'
The following theorem clarifies why these measures of variability are chosen:
Theorem. The function S(t) = (l/n)Y^=1(xi—t)2 hasthe minimum value att = x, i.e., at the sample mean.
The function D(t) = (1/n) Y^7=i \Xi ~ ^ ^as ^e m(w(" mum value att = x, i.e., the median.
Proof. The minimum of the quadratic polynomial j(t) = Y17=i (xi —     is at the only root of its derivative:
n
f'(t) = -2j2(xt-t).
i=l
Since the sum of the distances of all values from the sample mean is zero t = x is the requested root and the first proposition is proved.
As for the second proposition, return to the definition of the median. For this purpose, rearrange the sum so that the first and the last summand is added, then the second and the last-but-one summand, etc. In the first case, this leads to the
666
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
values, thereby decreasing the number of rectangles, or to use thinner rectangles).
□
25
20
15
10
0    50  100 150 200 250 300 350 400 450 500 pořadí studentů
We can notice that the mode of the values is 10, which, accidentally, was also the number of points necessary to pass the course. The mean of the obtained points is 9.48.
10.A.8. Here, we present column diagrams of the amounts of points of MB 101 students in autumn 2010 (the very first semester of their studies). The first one corresponds to all students of the course; the second one does to those who (3 years later) successfully finished their studies and got the bachelor's degree.
—
in
expression — 11 + \x^ — t\, and this is equal to the distance x^ — x^ provided t lies inside the range, and it is even greater otherwise. Similarly, the other pair in the sum gives £(„-1) — £(2) if £(2) < t < X(n_1), and it is greater otherwise. Therefore, the minimality assumption leads to t = x. □
In practice, it is required to compare the variability of data sets of different statistical populations. For this purpose, it is convenient to relativize the scale, and so use the coefficient of variation of a data set x:
Vx =
\x\
This relative measure of variability can be perceived in percentage of the deviation with respect to the sample mean x.
10.1.7. Skewness of a data set. If the values of a data set are distributed symmetrically around the mean value, then
x = x
However, there are distributions where
x > i.
This is common, for instance, with the distribution of salaries in a population where the mean is driven up by a few very large incomes, while much of the population is below the average.
A useful characteristic concerning this is the Pearson coefficient, given by
x — x {3 = 3-.
It estimates the relative measure (the absolute value of (3) and the direction of the skewness (the sign). In particular, note that the standard deviation is always positive, so it is already the sign of x — x which shows the direction of the skewness.
QUANTILE COEFFICIENTS OF SKEWNESS
More detailed information can be obtained from the quantile coefficients of skewness
x\—y -\- Xp 2>x Pp — ,
X\—p Xp
for each 0 < p < 1/2. Their meaning is clear when the numerator is expressed as (x1-p — x) — (x — xp).
In particular, the quartile coefficient of skewness is obtained when selecting p = 0.25.
Again, the results can be depicted in an alternative way:
10.1.8. Diagrams. People's eyes are well suited for perceiving information with a complicated structure. That is why there exist many standardized tools for displaying statis-'////■ ■ tical data or their correlations. One of them is the box diagram.
667
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
0        100      200      300      400      500 600 porsdf studentu
And these are the graphs of amounts of points obtained by those students who continued their studies:
16
0      2      4      6      8     10    12    14    16 18 body
0      20     40     60     80    100   120   140 160 poradf studentu
We can see that in the former case, the mode is equal to 0, while in the latter case, it is 10 again. The frequency distribution is close to the one of the MB 104 course, which is recommended for the fourth semester.
BOX DIAGRAM
The diagram illustrates a histogram and a box diagram of the same data set (normal distribution with mean equal to 10 and variance equal to 3, n = 500).
The middle line is the median; the edges of the box are the quartiles; the "paws" show 1.5 of the interquartile range, but not more than the edges of the sample range.
Common displaying tools allow us to view potential dependencies of two data sets. For instance, in the left-hand diagram below, the coordinates are chosen as the values of two independent normal distributions with mean equal to 10 and variance equal to 3. In the right-hand illustration, the first coordinate is from the same data set, and the second coordinate is given by the formula y = 3x + 4. It is also perturbed with a small error.
10.1.9. Covariance matrix.    Actually, the depencies be-jfj,    tween several data sets associated to the same sta-.J i/j    tistical units are at the core of our interest in many /^^SS^ real world problems. When definining the vari-" ance in 10.1.6 above, we employed the euclidean distance, i.e. we evaluated the scalar product of the values of the square of distances from the mean with itself. Thus, having two vectors of data sets, we may define
668
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.A.9. A car was traveling from Brno to Prague at 160 km/h, and then back from Prague to Brno at 120 km/h. What was its average speed?
Solution. This is an example where one might think of using the arithmetic mean, which is incorrect. The arithmetic mean would be the correct result if the car spent the same period of time going at each speed. However, in this case, it traveled the same distance, not time, at each speed. Denoting by d the distance of Brno and Prague and by vp the average speed, we obtain
d       d  _ 2d 160 + 120 ~ V
whence
160 ^ 120
Therefore, the average speed is the harmonic mean (see 10.1.3) of the two speeds.
□
B. Visualization of multidimensional data
The above examples were devoted to displaying one numerical characteristic measured for more objects (number of points obtained by individual students, for example). Graphical visualization of data helps us understand them better. However, how to depict the data if we measure p different characteristics, p > 3, of n objects. Such measurements cannot be displayed using graphs we have met.
10.B.1. One of the possible methods is the so-called principal component analysis. In this method, we use eigenvectors and eigenvalues (see 2.4.2) of the sample covariance matrix (see 10.2.35). We will use the following notation:
• random vectors of the measurement
■^-i       (xil, %i2, • • • 5 xifj)   » ^ 1, . . . , 71,
• the mean of the j-th component
• the sample variance of the j-th component
sj = t^t EILi (xij -mj)2'j = i> ■ ■ ■ > p.
• the vector of means m = (m1,..., rap),
• the sample covariance matrix
tt=t Er=i(x* - mXx* - m)T
(note that each summand is a p-by-p matrix).
The covariance matrix is symmetric, hence all its eigenvalues are real and its eigenvectors are pairwise orthogonal. Moreover, considering the eigenvectors of unit length, we
Covariance and covariance matrix
Consider two data sets x = (x1,..., xn), y = (y1,..., yn), and their means x, y. We define their covariance by the formula
1 "
cov(a;,y) = - ^(z* - x)(yt -y).
i=l
If there are k sample sets a^1)  = (x±\ ..., x1^),
= (x[k\ ..., x^), then their covariance matrix is the
symmetric matrix C = (c^) with ctj = cov(x<-t\x<-^). Again, the sample covariance and sample covariance
matrix are denned by the same formulae with n replaced by
(«-!)■ _
Clearly the covariance matrix has got the variances of the individual data sets on its diagonal.
In order to imagine what the covarinace should say, consider the two possible behaviours of two data sets: (a) they will deviate from their means in a very similar way (comparing individually x{ and y,), (b) they behave very independently. In first case, we should assume that the signs of the deviations will mostly coincide and thus the sum in the definition will lead to a quite big positive number. In the other case the signs should be rather independent and thus the positive and negative contributions should effectively cancel each other in the covariance sum.
Thus we expect the data sets expressing independent features to be close to zero while the covariance of dependent sets should be far from zero. The sign of the covariance shows the character of the dependence. For example, the two sets of data depicted in the left hand diagram above had covariance bout -0.11, while the covariance of the data from the right hand picture was about 25.9.
Similarly to the variance, we are often interested in normalized values. The correlation coefficient takes the covariance and divides it by the standard deviation of each of the data sets. In our two latter cases, the correlation coffeicients are about -0.01 and 0.99. As expected, they very clearly indicate which of the data are correlated.
10.1.10. Principal components analysis. If we deal with statistics involving many parameters and we need to decide quickly about their similarity (correlation) with some given patterns, we might use a simple idea from linear algebra.
Assume we have got k data sets x^>. Since their covariance matrix C is symmetric, there is an orthonormal basis e in Rk such that in this basis the corresponding quadratic form given by C will enjoy a diagonal matrix. The relevant basis e consists of the real eigenvectors e{ e Kfe for the eigenvalues Ai. The bigger is the absolute value A{|, the bigger is the variation of the orthogonal projection x of all the k data sets into this one-dimensional subspace spaned by e{.
Thus we may restrict ourselves to just this one data set x and consider the statistics concerning this one set as representing the multi-parametric data sets x^. Similarly we may
669
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
can see that the eigenvalue corresponding to an eigenvector of the covariance matrix yields the variance of (the size of) the projection of the data onto this direction (the projection takes place in the p-dimensional space). The goal of this method is to find the direction (in the p-dimensional space of the measured characteristics) for which the variance of the projections is as great as possible. Thus, this direction corresponds to the eigenvector of the covariance matrix whose eigenvalue is the greatest one. The linear combination given by the components of this vector is called the first principal component. The size of the projection onto this direction estimates the data quite well (the principal component can be viewed as a characteristic which substitutes for the p characteristics, i. e., it is a random vector with n components). If we subtract this projection from the data and consider the direction of the greatest variance again, we get the second principal component. Repeating this procedure further, we obtain the other principal components. The directions of the principal components correspond to the eigenvectors of the covariance matrix in decreasing order with respect to the size of the corresponding eigenvalues.
10.B.2. Find the first principal component of the following simple data and the vector which substitutes them: Five people were taken their height, little finger length, and index finger length. The measured data are shown in the following table (in centimeters).
Solution.
	Martin	Michael	Matthew	John	Peggy
index f.	9	11	8	8	8
little f.	7.5	8	6.3	6	6.5
height	186	187	173	174	167
The vectors of the collected data are: xi = (9; 7.5; 186), x2 = (11; 8; 187), x3 = (8; 6; 173), x4 = (8; 6; 174), x5 = (8; 6.5,167). The covariance matrices of these vectors are:
/0.04 0.14 \1.72
'0.641 0.640 ,3.521
0.14	1.72 \	/4.840
0.49	6.02 ,	2.64
6.02	73.96/	\21.12
0.640	3.521\	/0.641
0.640	3.52 ,	0.640
3.52	19.36/	V2.721
2.64 21.12X
i. 44 11.52
ii. 52 92.16/
0.640 2.721> 0.640 2.72
2.72 11.56;
'0.641   0.240   8.321 > 0.240   0.09 3.12 , 8.32    3.12 108.16;
also use several biggest eigen-values instead of one and reduce the dimension of our parameter space in this way. Finally, considering the unit length eigenvector (qi, ..., ak) corresponding to the chosen eigenvalue A, then the values olj provide the right coefficients in the orthogonal projection (x^,.. . ,x(k~)) i-> x = aixW H-----h akx(k\
See the exercise 10.B.2 for an illustration, together with another description how to proceed with the data in 1 O.B.I.
The latter approach is called the principal component analysis.
10.1.11. Entropy. We also need to describe the variability ,, of data sets even with nominal types, for instance in statistical physics or information theory. The only thing at disposal is the class frequencies, so the principle of classical probability can be used (see the fourth part of chapter one). There, the relative frequency of the i-th class, pi = ^i, is understood to be the probability that a random object belongs to this class.
The variance of ratio-type values with class frequencies rij was given by the formula (see 10.1.6)
where pj denotes the (classical) probability that the value is in the j-th class. Therefore, it is a weighted mean of the adjusted values where the weight of the term (a, — x)2 is pj.
The variability of nominal values are expressed similarly (denote it by Hx). Even though there are no numerical values aj for the indices j, we can be interested in functions F that depend on the relative frequencies pj. For a data set x we can define
where F is an unknown function with some reasonable properties.
If the data set has only one value, i.e. pk = 1 for some k and otherwise pj = 0, then we agree that the variability is zero, and so F(l) = 0.
Moreover, Hx is required to have the following property: If a data set Z consists of pairs of values from data sets X and Y (for example, one can observe eye colour and hair colour of people - statistical units), it is reasonable that the variability of Z be the sum of the variabilities, that is, Hz = Hx + Hy.
The relative class frequencies pi for the values of the data set X and qj for those of Y are known. The relative class frequencies for Z are then
ninrij nra
so we demand the equality (the ranges of the sums are clear from the context)
J2PiqjF(Piqj) = ^PtF{pi) + J2qjF(qj).
670
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The sample covariance matrix is then a quarter of their sum, i. e.,
/ 1.70   1.075   9.35 \ S =   1.075   0.825 6.725 \9.35   6.725 76.30/
The eigenvalues of S are approximately 2.7, 312.2, and 0.38. The unit eigenvector corresponding to the greatest one is approximately (0.122; 0.09; 0.989). Thus, the first principal component is (185 5; 186.8; 172.4; 173.4; 166.5), which is not far from the people's heights. □
10.B.3. The students of a class had the following marks in various subjects:
Find the first principal component of the following simple data and the vector which substitutes them. Solution. The vectors of observation are xi = (1,1,2,2,1), xio = (2,3,1,2,1). The corresponding covariance matrices are:
/ 1.21       1.10     -0.330    -0.330     0.110 \ 1.10        1.       -0.300    -0.300 0.100 -0.330   -0.300    0.0900     0.0900 -0.0300 -0.330   -0.300    0.0900     0.0900 -0.0300 \ 0.110     0.100    -0.0300   -0.0300    0.0100 /
/ 0.0100 -0.100 0.0701 -0.0300 0.0100 \
-0.100 1. -0.700 0.300 -0.100
0.0701 -0.700 0.490 -0.210 0.0701
-0.0300 0.300 -0.210 0.0900 -0.0300
\ 0.0100 -0.100 0.0701 -0.0300 0.0100 /
The sample covariance matrix is
/ 0.99 0.44 -0.078     0.26 -0.01 \
0.44 0.89 -0.22      0.22 -0.11
-0.078 -0.22 0.45       0.23 0.03
0.26 0.22 0.23       0.45 -0.078
\-0.01 -0.11 0.033 -0.0778    0.100 /
Its   dominant   eigenvalue   is   about   13.68, and
the  corresponding  unit  eigenvector  is approximately
(0.70; 0.65;-0.13; 0.28;-0.07).   Therefore, the principal
Student id	Maths	Physics	History	English	PE
1	1	1	2	2	1
2	1	3	1	1	1
3	2	1	1	1	1
4	2	2	2	2	1
5	1	1	3	2	1
6	2	1	2	1	
7	3	3	2	2	1
8	3	2	1	1	1
9	4	3	2	3	1
10	2	3	1	2	1
Since pi and qj are relative frequencies, they sum to 1. So the right-hand side of the equality can be written as
leading to
This is satisfied by any constant multiple of a logarithm of any fixed base a > 1. It can be shown that no other continuous solution F exists.
Since pi < 1, In < 0. The variability must be non-negative, so F is chosen to be a logarithmic function multiplied by —1. Such a choice also satisfies -F(l) = 0, as desired.
Entropy
The measure of variability of nominal values is expressed in terms of entropy. It is given by
where k is the number of sample classes. Sometimes (especially in information theory), the binary logarithm is used instead of the natural logarithm.
One often works with the quantity
n
ft
(or with another logarithm base).
In this form, for a data set X with k equal class frequencies, compute
= k.
which is independent of the sample size. The next illustration shows 2-based entropy y for the number of occurrences of ' letters a, & in 10-letter words consisting of these characters, and x is the number of occurrences of b.
Note that the maximum entropy 1 occurs for the same number of a's and &'s, and indeed 21 = 2 as computed above.
The following illustration displays the entropy of 11 randomly chosen strings of length 10 made of 8 characters. The values are all much less than the theoretical maximal value of 3. This reflects the fact that the number of occurences of the individual 8 characters cannot be equal (or it could happen with a very small probability if the length of the string was 8
671
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
componentis (1.58; 2.73; 2.13; 2.93; 1.45; 1.93; 4.28; 3.48; 5.26p6M)) If the same is done with, say, strings of length 10000,
□   we would get very close to 3 (typically the difference would
,   , r .    ..   .     „    .... be in the order of 10~3, if the random string generator was
Another possible method of visualization of multidimen-        ,        ..
good enough).
sional data is the so-called cluster analysis, but we will not go
into further details here. ♦
C. Classical and conditional probability
In the first chapter, we met the so-called classical probability, see 1.4.1. Just to recall it, let us try to solve the following (a bit more complicated) problem:
10.C.1. Ales wants to buy a new bike, which costs 5100 crowns. He has 2500 crowns left from organizing a camp. Ales is no dope: he took 50 more crowns from his pocket money and went to the casino to play the roulette. Ales always bets on red. This means that the probability of winning is 18/37 and the amount he wins is equal to the amount he has bet. His betting strategy is as follows: The first time, he bets 10 crowns. Each time he has lost, he bets twice the previous bet (if he does not have enough money to make this bet, he leaves the casino, deeply depressed). Each time he has won, he bets 10 crowns again. What is the probability that, using this strategy, he wins the desired 2550 more crowns? (As soon as this happens, he immediately runs to buy the bike.)
Solution. First of all, we calculate how many times Ales can lose in a row. If he bets 10 crowns the first time, then in order to bet n times, he needs
10+20+- ■ ■+10-2™-1 = 10-1 2
10-
2™ - 1 2 - 1
10
As we can see, the number 2550 is of the form 10(2™ — 1), for n = 8. This means that Ales can bet eight times in a row regardless of the odds. He can never bet nine times in a row, because for that he would have to have 10(29 — 1) = 5110 crowns, which he will never reach (he stops betting as soon as he has 5100 crowns). Therefore, Ales loses the whole game if and only if he loses eight consecutive bets. The probability of losing one bet is 19/37; hence, the probability of losing eight consecutive (independent) bets is (19/37)8. Thus, the probability that he wins 10 crowns (using his strategy) is 1 — (19/37)8. In order to win 2550 crowns, he must win 255 times, and the probability of this is
o \ 255
2. Probability
Before further reading, the reader is advised to go through the fourth part of chapter one (the subsection beginning on page 18). Back then, we worked mainly with classical finite probability. We denned the basics of a formalism which we extend now. The main extension is that the sample space fi can be infinite, even uncountable. Recall that when we talked about geometric probability at the end of the fourth part of chapter one, the sample space for description of an event was a part of the Euclidean space, and events were suitable subsets of it. All of those sets were uncountable.
Begin with a simple (infinite, yet still discrete) example, to which we return from time to time throughout this section.
10.2.1. Why infinite sets of events? Imagine an experiment where a coin is repeatedly tossed until it comes up heads. There are many questions to be asked about this experiment: What is the probability of tossing the coin at least 3 times? (or exactly times, or at most 10 times, etc.) The outcomes of this experiment can be considered in the forma;;; e N>i U {oo}, which could be read as "the coin comes up heads for the first time in the fc-th toss". Note that k = oo is inserted, since the possibility that the coin always comes up tails must be allowed, too.
This problem is solved if the classical probability 1/2 of the coin coming up heads in one toss is used (and the same for tails). In the abstract model, the total number of tosses by any natural number N cannot be bounded. On the other hand, the probability that the coin always comes up tails in the first (k — 1) tosses out of the total number of n > k tosses is given by the fraction
2n-
2n
= 2"
where in the numerator, there is the number of favorable possibilities out of n independent tosses (i.e. the number of possibilities how to distribute two values into the n—k remaining positions), while in the denominator, there is the number of
672
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Therefore, the probability of winning using his strategy is much lower than if he bet everything on red straightaway.
□
10.C.2. You could try to solve a slight modification of the above problem: Joe stops playing only if he loses all his money; if he still has some money, but not enough to bet twice the previous bet, he bets 10 dollars again.
We also met the conditional probability in the first chapter, see 1.4.8.
10.C.3. Let A, B be two events such that B is a disjoint union of events B\, B2, ■ ■ ■, Bn. Using the definition of conditional probability (see 10.2.6), prove that
(1)
P(A\B) =^TlP(A\Bi)P(Bi\B)
Solution. First, note that the events Af]Bu Af)B2,... ,Af)Bn are, also disjoint. Therefore, we can write
^1u---u^) = P(lnP(glU'"^")) =
V   1 n> P(BiU---UB„)
_ P ((A n JJi) u (A n B2) u ■ ■ ■ u (A n Bn)) _ - PJB) -
_ T.UPJA^By)   P(Bt) _
P(Bi) P(B)
n
= Y,P(A\Bi)P(Bi\B).
□
10.C.4. We have four bags with balls: In the first bag, there are four white balls. In the second bag, there are three white balls and one black ball. In the third bag, there are two white and two black balls. Finally, in the fourth bag, there are four black balls. We randomly pick a bag and take two balls out of it (without putting the first one back). Find the probability that
a) the balls are of different colors;
b) the second ball is white provided the first ball was white.
Solution. Since there is the same number of balls in each of the bags, any ball has the same probability of being taken (similarly for any pair of balls lying in the same bag). Therefore, we can solve this problem using classical probability
a) Altogether, there are 24 pairs of balls that can be taken. Out of them, 7 consist of balls of different colors. Therefore, the wanted probability is 7/24.
all possible outcomes. As expected, this probability is independent of the chosen n, and there is the J2T=i 2~k — 1-Therefore, the probability of tossing only tails is zero.
Thus we can define probability on the sample space fl with sample points (outcomes) cu^, whose probability is 2~k. This leads to a probability according to the following definitions.
We return to this example throughout this section.
10.2.2. <7-fields. Work with a fixed non-empty set fl, which contains the possible outcomes of the experiment and which is called the sample space. The fi'tuNfe possible outcomes a; G fl are also called sample points. In probability models, not all subsets of outcomes need be admitted. In particular, the singletons {cu} need not be considered. Those subsets whose probability we want to measure are required to satisfy the axioms of the so called a-algebras.
The axioms listed below are chosen from a larger collection of natural requirements in a minimal form. The first one is based on the assumption that the universal event should be a measurable set. The second one is forced by the assumption that events can be negated. The third one reflects the necessity to examine the event of the occurrence of at least one event from a countably infinite collection. (For instance, in the example from the previous subsection, the coin is tossed only finitely many times, but there is no upper bound on the number of tosses.).
(7—ALGEBRAS OF SUBSETS
A collection A of subsets of the sample space is called a a-algebra or a-field and its elements are called events or measurable sets if and only if
• fl G A, i.e., the sample space is an event;
• ifA,BeA, then A \ B G A, i.e., the set difference of two events is also an event;
• if A{ G A, i G 7, is a countable collection of events, then their union is also an event, i.e., UieiA{ G A.
As usual, the basic axioms imply simple corollaries which describe further (intuitively required) properties in the form of mathematical theorems. The reader should check carefully that both following properties hold.
• The complement Ac = fl \ A of an event A is again an event.
• The intersection of two events is again an event since for any two subsets A, B c fl,
A\(fl\B) = AnB.
Actually, for any countable system of events Ai, i G 7, the event
fl \ uteUAc, = nteIAt
is also in the tr-algebra A.
Altogether, a tr-algebra is a collection of subsets of the sample space which is closed with respect to set differences, countable unions, and countable intersections.
673
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
b) Let A denote the event that the first ball is white and B denote the event that the second ball is white. Then, P(B OA) is the probability that both balls are white, and this is equal to 10/24 = 5/12 since there are 10 such pairs. Again, we can used classical probability to calculate P(A): there are 16 balls in total, and 9 of them are white. Altogether, we have
P(B n A) _ j| _ 20
~ jl ~ 27' 16 z'
P(B\A) =
P(A)
Another solution. The event A can be viewed as the union of three mutually exclusive events A\, A2, A3 that we took a white ball from the first, second, and third bag, respectively. Since there is the same number of balls in each of the bags, the probability of taking any (white) ball is also the same (independent of which ball it is), so we get P(A) = A and P(AX\A) = # = f. P(M\A) = § = ±,
10.2.3. Probability space. Now introduce probability in the mathematical model, recalling the concepts used already in the first chapter.
Elementary concepts Use the following terminology in connection with events:
• the entire sample space fl is called the universal event; the empty set 0 £ A is called the null event;
• the singletons cu £ i?are called elementary events (note that {cu} may not even be an event in A);
• the intersection of events f)ieiAi corresponds to the si-multaneous occurrence of all the events Ai,i £ J;
• the union of events UieiA{ corresponds to the occurrence of at least one of the events Ai, i £ /;
• if AnB = 0, then A, B e A are called exclusive events or disjoint events,
• if A £ B, then the event A implies the event B;
• if A e A, then the event B = fl \ A is called the complementary event to A and denoted B = Ac.
P(A3\A) = P(B\A) =
Applying (5), we obtain
We have seen an example of probability denned on an infinite sample space in 10.2.1 above. In general, probability
P(B|yli)P(yli|yl) + P(B\A2)P(A2\A) + P(5^p)^r^A<|edas follows:
P(A)
+ P(B\A2
P(A2) P(A)
+ P(B\A3
P(A3
Probability
P(P(AJ
4 2 9 + 3
3 12 9 + 39
20 27'
□
10.C.5. We have four bags with balls: In the first bag, there are four white balls. In the second bag, there are three white balls and one black ball. In the third bag, there are two white and two black balls. Finally, in the fourth bag, there are one white and three black balls. We randomly pick a bag and take a ball out of it, finding out that it is black. Then we throw away this bag, pick another one and take a ball out of it. What is the probability that it is white?
Solution. Similarly as in the above exercise, let A denote the event that the very first ball is black. This event can be viewed as the union of mutually exclusive events A{, i = 2,3,4, where A{ is the event of picking the i-th bag and taking a black ball from there. Again, the probability of picking any (black) ball is the same. Hence, P(A2\A) = i, P(A3\A) = § = |, and P(A4\A) = § = \. Let B denote the event that the second ball is white. If the thrown bag is the second one, then there are a total of 7 white balls remaining, so the probability of taking one of them is P(B\A2) = A (we can use classical probability again because each of the bags contains the same number of balls, so any ball has the same probability of being taken). Similarly, P(B\A3) = A
Definition. A probability space is the tr-algebra A of subsets of the sample space fl on which there is a scalar function P:i^t with the following properties:
• P is non-negative, i.e., P(A) > 0 for all events A;
• P is countably additive, i.e.,
P(UieIAi) = Y,P(Ai), iei
for every countable collection of mutually exclusive events;
• the probability of the universal event is 1.
The function P is called the probability function on (fl, A).
Immediately from the definition, the complementary event satisfies
P(AC) = \-P(A).
In chapter one, theorems on addition of probabilities were derived. Although dealing with finite sample spaces, the arguments remain the same now. In particular, the inclusion and exclusion principle says for any finite collection of k events Ai that
k k-1 k
p(uk=1Ai) =   p(A) -EE p^n aj)
i=l i=l j=i-\-l
k-2  k-1 k
+ E E E PiAnA.nAe)
i=ij=i+ie=j+i -----h
+ (-i)k-1P(A1nA2n---nAk).
674
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
and P(B\A±) = ^. Applying (5), we get that the wanted probability is
P{B\A) = P{B\A2)P{A2\A) + P{B\A3)P{A3\A) + P(B\A4)P(A4\A) =
7_
12 _ 6
I  I  J_ . I I
6T12    3 ^
iL I
12 2
25 36'
□
10.C.6. We have four bags with balls: In the first bag, there are a white ball and a black ball. In the second bag, there are three white balls and one black ball. In the third bag, there are one white and two black balls. Finally, in the fourth bag, there are one white and three black balls. We randomly pick a bag and take a ball out of it, finding out that it is white. Then we throw away this bag, pick another one and take a ball out of it. What is the probability that it is white?
Solution. Similarly as in the above exercise, we view the event A of the first ball being white as the union of four mutually exclusive events Ai, A2, A3, and A4 that we take a white ball from the first, second, third, and fourth bag, respectively. The probability of taking a white ball out of the first bag is P(Ai) = \ ■ \ (the probability of A\ is the product of the probability that we pick the first bag and the probability that we take a white ball from there); simi-
larly, P(A2
1 a 4 ' 4'
^3)
P(A.
l l
4)   -   4 ■ 4-
11
24
Note
P(A) = PiA,) + P(A2) + P(A3) + P(A4) that the probability P(A) cannot be calculated classically, i. e., by simply dividing the number of white balls by the total number of the balls, because, for instance, the probability of taking a white ball from the first bag is twice greater than from the fourth bag. As for the conditional probabilities, wehaveP(^i|A) = P(vli)/P(vl) = j\, P(A2\A) = P(A3\A) = j\,P{A4\A) = ^. Now, let B denote the event that we take another white ball after we have thrown away the first bag. We want to apply (5) again. It remains to compute P{B\Ai), i = 1,... A. The probability can be
computed as the sum of the probabilities of the mutually exclusive events B2, B3, B4 (given A{) that the second white ball comes from the second, third, fourth bag, respectively. Altogether, we have
= PiB^A^+PiBslA^+PiB^A,
Similarly,
P(B\A2) P(B\A3)
13 11 34 + 33
The reader should look back at 1.4.5 and think about the details.
10.2.4. Independent events. The definition of stochastically independent events also remains unchanged. It reflects the intuition that the probability of the simultaneous occurrence of independent events is equal to the product of the particular probabilities.
Stochastic independence
Events A, B are said to be stochastically independent if and only if
P(A n B) = P(A)P(B). Of course, the universal event and the null event are stochastically independent of any event.
Recall that replacing an event A{ with the complementary event A\ in a collection of stochastically independent events A\, A2, again results in a collection of stochastically independent events, and (see ??, page ??)
P{A1 U ■
UAk) = l- p{A\ n ■
= i-(i-P(A1))...(i-P(Ak)).
Classical finite probability remains the fundamental example of probability, used as the inspiration during creation of the mathematical model. Recall that in this case, fl is a finite set, the tr-algebra a is the collection of all subsets of fl, and the classical probability is the probability space (fl, a, p) with probability function p : a -> R,
Pt*> = W
This corresponds precisely to the intuition about the relative frequency pa of an event A when drawing a random element from the sample set fl.
This definition of probability guarantees reasonable behaviour of monotone sequences of events:
10.2.5. Theorem. Consider a probability space (fl, a, p) and a non-decreasing sequence of events A\ C A2 C .... Then,
p({Ja) = lim p(Ai).
\ = 1 '
Similarly, if A\ d A2 d A3 d ..., then
p(r\Ai) =lim p(^)-
II I II I II
11,13,11
32      34 34
13 36'
1
2'
1 l PrQof. The considered union A = can be
^jyyritteji in terms of mutually exclusive events
Ai = Ai \ Ai—i,
defined for all i = 2,3,.... Set A1=A1. Then,
/ OO \ OO k
P(A) = p(\jAi)='£P(Ai)=mn
675
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
P(B\A4) = Altogether, we get
11 _i_ 13 , 11 32 + 34 + 33
19
36'
For the finite sums,
P&) = P(Ai) + J2(P(Ad - P(^-i)) = P(An)
P(B\A) = P(B\Ai)P(Ai\A) + P(B\A2)P(A2\A) + P(B|A^^si^nkpff^^i)J?(d4#)This proves the first part of
19 the theorem.
44 In the second part, consider the complements B{ = A':
instead of the events A{. They satisfy the assumptions of the first part of this theorem. Then, the complement of the considered intersection is
4 3 13 9 12 19 3 9lT + 3li2^+211 + 3622
□
10.C.7. Two shooters shoot at a target, each makes two shots. Their respective accuracies are 80 % and 60 %. We have found two hits in the target. What is the probability that they belong to the first shooter?
Solution. The probability of hitting the target is 4/5 for the first shooter, and 3/5 for the second one. Consider the events: A... there are two hits in the target, both of the first shooter,
B ... there are two hits in the target.
Our task is to find P(B\A). We can divide the event B into six disjoint events according to which shot(s) of each shooter was/were successful. We enumerate the events in a table and, for each of them, we compute its probability. This is easy as each of the events is the intersection of four independent events (results of the four shots). A hits is denoted by 1, amiss by 0.
	Shooter 1   Shooter 2 probability				
	0	1	0	1	14    2 3 5 ' 5 ' 5 ' 5
B2	0	1	1	0	24 252
B-i	1	0	1	0	24 252
BA	1	0	0	1	24 252
B5	1	1	0	0	64 252
B6	0	0	1	1	y 252
Adding up the probabilities of these disjoint events, we get:
6
P(B) = J2p(Bt) = 169/625.
i=l
Now, we can compute the conditional probability, using the formula of subsection 10.2.6:
P(AnB) _ P(B5) _ H _ 64 P(B)        P(B)      if     164'U^8-
□
P(A\B) =
B = A°=(f)Ai) =)JB,
= 1      ' i=l
The desired statement follows from the fact that
P(A) = 1 - P(B) = 1 - lim P(Bt) = lim (1 - P(Bi))
i—>-oo i—>-oo
which completes the proof. □
10.2.6. Conditional probability. Consider the following problem: On average, 40% of students succeed in course X and 80% of students succeed in course Y. If a random student is enrolled in both these courses saying that he has passed one of them (but we overhear which one), what is the probability that he has meant course XI
As mentioned in subsection 1.4.8 (page 24), such problems can be formalized in the way described below.
Conditional probability
Definition. Let H be an event with non-zero probability in the tr-algebra A of a probability space (i?, A, P). The conditional probability P(A\H) of an events e A with respect to the hypothesis H is defined as
P(AnH)
P(A\H) =
P(H)
The definition corresponds to the intuition from the classical probability that the probability of events A and H occurring simultaneously, provided the event H has occurred, is P(AnH)/P(H).
Directly from the definition, the hypothesis H and the event A are independent if and only if P(A) = P(A\H).
At first sight, it may seem that introducing conditional probability does not add anything new. Actually, it is a very important type of approach which is needed in statistics as well. The hypothesis can be the a prior probability (i.e. the prior belief assumed beforehand), and the resulting probability is said to be posterior (i.e., it is considered to be a consequence of the assumption). This is the core of the Bayesian approach to statistics as is seen later.
The definition also implies the following result.
676
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.C.8. We toss a coin. If it comes up heads, we put a white ball into an (initially empty) bag; otherwise we put a black ball there. This is repeated n times. Then, we take a ball randomly from the bag (without replacement). Suppose it is white. What is the probability that another ball we take randomly from the bag is black?
Solution. We will solve the problem for a general (possibly biased) coin. In particular, we assume that the individual tosses are independent and that there exists a fixed probability of the coin coming up heads, which we denote p. The event "a ball in the bag is white" corresponds to the event "the coin came up heads in the corresponding toss". Since the first ball was white, we deduce that p > 0. We can also see that the probability space "taking a random ball from the bag" is isomorphic to the probability space "tossing a coin". Since we assume that the individual tosses are independent, we also get the independence of the colors of the selected balls. This leads to the conclusion that the probability in question is I—p.
Is this reasoning correct? Do we not expect the probability of taking a black ball to be greater than 1 — pi See, there were approximately np white and n(l — p) black balls in the bag, so if we had removed one white ball, the probability of selecting a black one should increase, shouldn't it? Before reading further, try to figure out which (if any) of these two presented reasonings is correct, and whether the probability is also dependent on n (the number of balls in the bag before any were removed).
Now, we select a more sophisticated approach to the problem. Let B{ denote the event "there were i white balls in the bag" (before any were removed), i e {0,1,2,..., n). Further, let A denote the event "the first ball is white" and C denote the event "the second ball is black". Actually, the event Bi says that the coin came up heads i times out of n; hence, its probability is
Lemma. Let an event B be the union of mutually exclusive events B\, B2,. ■ -,Bn. Then,
P{Bi)=[\:jf{l-pY-\
The conditional probability of taking a white ball provided there are exactly i white balls in the bag is equal to
P{A\Bt) = -.
n
We are interested in the probability of C, knowing that A has occurred, i. e., we want to know P(C\A). Since the events Bi axe pairwise disjoint, this is also true for the events C nB{. Since C can be decomposed as the disjoint union \J™=0(C n
(1)
P(A\B) =YJP{A\Bi)P{Bt\B)
A n Br, axe
Proof. The events A n BltA n B2, also mutually exclusive. Therefore,
PC 41 r i I     ,, p ,_P(An(B1U---UBn)) P(A\BlU---UBn)-     p{BilJ...uBn) ■
_ P ((A n Bj) u (A n B2) u ■ ■ ■ u (A n Bn)) - P(B) _E7=iP(AnBt)  P(BA _
P(Bt) P(B)
n
J2P(A\Bi)P(Bt\B).
□
Consider the special case B = fi. Then, the events Bi can be considered the "possible states of the universe", P(A\Bi) expresses the probability of A provided the universe is in its i-th state, and P(Bi\fH) = P(Bi) is the probability of the universe being in its i-th state. By the above lemma,
n
P(A) = P(A\tt) = YJP(ABi)P{Bi).
i=l
This formula is called the law of total probability.
10.2.7. Bayes' theorem. Simple rearrangement of the conditional probability formula leads to
P(A nB) = P(B nA) = P{A)P{B\A) = P{B)P{A\B).
There are two important corollaries:
Bayes' rules
Theorem. The probabilities of events A and B satisfy
P(A)P(B\A)
P(B)
P(A)P(B\A)
(1) P(A\B) =
(2) P(A\B)     p(+)p(£|+) + p(+C)p(£|+c) •
The first proposition is called the inverse probability formula. The second proposition is called the first Bayes' formula.
Proof. The first statement is a mere rearrangement of the formula above the theorem. To obtain the second statement, note that
P(B) = P(B nA) + P(B n Ac).
Applying the law of total probability, P(B) = P(A)P(B\A) + P(AC)P(B\AC) can be substituted into the inverse probability formula, thereby obtaining the second statement of the theorem. □
677
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
B{), we can write
^)-p(U(CnB.^)-EP('C^))n-').
1 "
^   > i=0 1 "
= pJa) E     n n Bi) =
(   ' i=0 n
= pJa) ETOn
We use the law of total probability and substitute for P(A), which leads to
J2P(B,)P(A\B,)P(C\AnB,
P(C\A) =
(1)
P(A)
n
Y,P(Bi)P(A\Bi)P(C\Ar\Bl
i=0_
n
E P{Bt)P{A\Bt)
i=0
This formula is sometimes called the second Bayes' formula; it holds in general provided the space fl is a disjoint union of the events Bi.
Since we tossed the coin at least once, we have n > 1. Now, we can calculate:
pmp^Bi) =   u J p^1 - p)n~l ■ -n
^(.-l)!(n-.)!Pl W
j=0 n-1
i!(n-i-l)!
pJ+1(l-pr
=pE("7>i(1-p)B-1"
j=0 ^ ^
= p(p+ (1 -p))n 1 =p,
=E(")pi(i-p)B-
i=0 n-1
i    n — i
E
(n-2)!
^ (i-l)!(n-i-l)!
71 71—1
=E-,i""o2)l.„p<+1(i-p)"-<-1
Bayes' rule is sometimes formulated in a somewhat more general form, proved similarly as in (2):
Let the sample space fl be the union of mutually exclusive events Ai,... An. Then, for any i e {1,..., n},
P(B\Ai)P(Ai)
(3)
P{At\B)
Y.UP(B\Ai)P{A,
10.2.8. Example and remarks. Now, the introductory question from 10.2.6 can be dealt with easily. Consider the event A which corresponds to "the student having passed an exam" and the event B which corresponds to "the exam in question concerning course X". Assume that the probabilities of the exam concerning either course are the same, i.e., P(B) = P(73c) = 0.5. While the wanted probability P(B\A) is unclear, the probability P(A\B) = 0.4 is given, as well as P(A\BC) = 0.8.
This is a typical application of Bayes' formula 10.2.7(2). There is no need to calculate P(A) at all:
P{B)P{A\B)
P(B\A) =
P(B)P(A\B) + P(BC)P(A\BC) 0.5 ■ 0.4       _ 1 ~ 3'
i=0
i\(n-2-i)V
0.5-0.4 + 0.5-0.8
In order to better understand the role of the prior probability hypothesis, here is another example.
Consider a university using entrance exams with the following reliability: 99% of intelligent people pass them, while concerning non-intelligent people, only 0.5% are able to pass. It is desired to find the probability that a random student (accepted applicant) of the university is intelligent.
Thus, let A be the event "a random person is intelligent" and B be the event "the person passed the exams successfully". Using Bayes' formula, the probability that A occurs provided B has occurred can be computed. It is only necessary to supply the general probability p = P(A) that a random applicant is intelligent.
P(A\m = p-°M K  1 '    p ■ 0.99+(l-p)-0.005'
The following table presents the result for various values of p. The first column corresponds to the case that every other applicant is intelligent, etc.
p     j 0.5    0.1   0.05   0.01   0.001 0.0001 P(A\B) I 0.99   0.96   0.91   0.67   017 0.02
Therefore, if every other applicant is intelligent, then 99% of the students are intelligent. If only 1% of the population meets an expectation of "intelligence" and the applicants form a good random sample, then only about two thirds of the students are intelligent, etc.
Consider similar tests for the occurrence of a disease, say HIV. There may be a test with the same reliability as the one above and use it to test all students that are present at the university. In this case, assume that the parameter p is close to the one for the entire population (say 1 out of 10000 people is infected, on average), which corresponds to the last column
678
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
=p{i-p)Y,(n-2)pi{i-Pr-*-i =
i=o \ '
= \p(l-p), n>l [0, n = l.
Substituting this into the second Bayes' formula, we obtain the wanted probability
P(C|A) = (0'        n = 1' [1 -p,   n > 1.
Thus, the simple reasoning about the probability spaces being isomorphic led to the correct result. The second reasoning was wrong because it omitted the fact that since the first ball was white, the expected number of white balls in the bag (before removing the first one) was greater than np. The calculation highlights the singular case n = 1. □
10.C.9. Once upon a time, there was a quiz where the first prize was Ferrari 599 GTB Fiorano. The contestant who won the final round was taken into a room where there were three identical doors. Behind two of them, there were goats, while the third one contained the car. In order to win the car, the contestant had to guess the correct door. First, the contestant pointed at one of the three doors. Then, an assistant opened one of the other two doors behind which there was a goat. Now, the contestant is given the option to change his guess. Should he do so?
Solution. Of course, we assume that the contestant wants to win the car. First of all, try to examine your intuition for random events. For example, you can reason as follows: "One of the two remaining doors contains the car, each with the same probability. Therefore, it does not matter which door we choose." Or: "The probability of choosing the correct door at the beginning is |. The shown goat changes nothing, so the probability that the guess is wrong is |. Therefore, we should change the door, thereby winning by |."
Apparently, it is wise to change the door only if the probability of the car being behind that door is greater than behind the initially chosen one. We consider the following events: H stands for "the initial guess is correct", A stands for "we have changed the door", and C for "we have won". We are thus interested in the probabilities P{C\A) and P(C\AC).
First, we choose one of three doors, and the Ferrari is behind one of them, so
of the table above. Clearly the result of the test is catastrophi-cally unreliable. Only about 2% of the students who are tested positive are really infected!
Note that the problem with both tests is the same one. It is clear that real entrance exams require good selectivity and reliability. So the university marketing must ensure that the actual applicants do not provide a good random sample of population. Perhaps the university should try to discourage "non-intelligent" people from applying and thus secure a sufficiently low number of such applicants. With diseases, even the very rare occurrence of healthy people tested positively can be devastating. If the test is improved so that it is 100% reliable for positive people, it would have almost no impact on the resulting probabilities in the table.
Thus, if a person is tested positive when diagnosing a rare disease, it is necessary to make further tests. Then, the result P(A\B) of the first test plays the role of the prior probability P(A) during the second test, etc. This approach allows one to "cumulate the experience".
10.2.9. Borel sets. In practice, the probability of events
tJT j-p.^ which are expressed by questioning whether C^llrP+ some numerical quantity falls into a given in-terval is of interested. We illustrate this on the example dealing with the results of students in a given course, measured for instance by the number of points in a written exam(cf. 10.1.1).
On one hand, there is only a finite number of students, and there are only a finite number of possible results (say, the numbers of points in the written exam can be the integers 0 through 20). On the other hand, imagining the results of the students as an analogy to independent rolls of a regular die is inappropriate. Even if a regular 21-hedron would exist (it cannot, see chapter 13); that would be somewhat weird.
Thus it is better to focus on the assessing function X : n —> R in the sample space fi of all students and model the probability that its value falls into a fixed interval when a random student is picked. For instance, if the table transferring points into marks A through F is fixed, the probability that the student obtained an A or a B can be modeled.
In the case of a reasonable course, we should expect that the most probable results are somewhere in the middle of the "interval of success", while the ideal result of the full number of points is not very probable. Similarly, if many values of X lie in the interval of failure, this may be at most universities perceived as a significant failure of the lecturer. This is a typical example of the random variables or random vectors, as denned below (it depends whether the result of just one or several students is chosen randomly).
One way to proceed is to model the behaviour of X as probability denned for all intervals. This requires the following tr-algebra:1
In this connection, we also talk about the c-algebra of Borel-measurable sets on Rk, and then the following definition says that random variables are Borel-measurable functions.
679
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
We assume that the event of changing the door is independent of the original guess, hence
P(A\H) = P(A\HC) = P(A),   P(AC\H) = P(AC\HC) =
If the original guess is correct and it is changed, then we surely lose; while if it is originally wrong and then it is changed, then we surely win. Therefore, we have
P(C\Ar\H) = Q = P(C\AcnHc),
P(C\AcnH) = 1 = P{C\AC\HC).
It follows from the second Bayes' formula (1) that
P{C\A) =
borel sets
The Borel sets in R are all those subsets that can be obtained fijojn intervals using complements, countable unions, and ^countable intersections.
More generally, on the sample space fl = Rfe, one considers the smallest tr-algebra B which contains all fc-dimensional intervals.
The sets in B are called the Borel sets on Rfe.
10.2.10. Random variables. The probabilities of the individual intervals in the Borel algebra are usually given as follows. Consider a numerical quantity X on any sample space, that is, a function X : fl —> R. Since it is desired to work with the
= P(H)P(A\H)P(C\A nH)+ P(HC)P(A\HC)P(C\A n Hc ^^Wity of X taking on values from any fixed interval, the
P{A) probability space and the properties of the function X have
—P(HC) — - to a^ow tn*Sl
3 Notice that working with finite probability spaces where
and, analogously, all subsets are events, every function X : fl —> R is a random
variable in the following sense.
P{C\AC) =
_P(H)P(AC\H)P(C\AC n H) + P(HC)P(AC\HC)P(C\AC nlT)
P(A<=)
=P(H) =
1
We have thus obtained P(C\A) > P(C\AC), which means that it is wise to change the door.
Note that the solution is based upon the assumption that the assistant deliberately opens a door behind which there is a goat. If the contestant believes it was an accident or if instead, say, he happens to see (or hear) a goat behind one of the two not chosen doors, then the first reasoning is correct and the
Random variables and vectors
Definition. A random variable X on a probability space (fl, A, P) is a function X : fl —> R such that the inverse image X~x (B) lies in A for every Borel set B e BonR. The real-valued function PX(B) = P(X~1 (B)) denned on all intervals B c R is called the (probability) distribution of a random variable X.
A random vector X = (X1,..., Xk) on (fl, A, P) is a fc-tuple of random variables X{ : fl —> R defined on the same probability space (fl, A, P).
probability remains to be \.
□
10.C.10. We have two bags. The first one contains two white and two black balls, while the second one contains one white and two black balls. We randomly select one of the bags and take two balls out of it (without replacement). What is the probability that the second ball is black provided the first one is white? O
D. What is probability?
First of all, recall the geometric probability, which was introduced in ??.
10.D.1. Buffon's needle. A plane is covered with parallel lines, creating bands of width /. Then, a needle of length / is thrown onto the plane. What is the probability that the needle crosses one of the lines?
If intervals I\,..., Ik in R are chosen, then the probability of simultaneous occurrence of all of the k events X{ e h must exist. Thus, as in the scalar case, there is a real-valued function defined on the fc-dimensional intervals B = Ji x ■ ■ ■ x Ik, Px(B) = PiX-1 (B)) (and thus also for all Borel sets B c Rfe). It is called the probability distribution of the random vector X.
10.2.11. Distribution function. The distribution of random variables is usually given by a rule which shows how the probability grows as the interval B is extended.
In particular, consider the intervals I with endpoints a, b,
—oo < a < b < oo. Denote P(a < X < b) the probability of X lying in I = (a, b), or P(X < b) if a = —oo; and analogously for other types of intervals. In the special case of a singleton, write P(X = a).
In the case of a random vector X = (X1,..., Xk), write P(ai < Xi < bi,..., ak < Xk < bk) for the probability of simultaneous occurrence of the events where the values of Xi fall into the corresponding intervals (which may also be closed, unbounded, etc.).
680
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Solution. The position of the needle is given by two independent parameters: the distance d of the needle's center from the closest line (d e [0,1/2]) and the angle a (a e [0, ir/2]) between the lines and the needle's direction. The needle crosses one of the lines if and only if 1/2 sin a > d. The space of all events (a, d) is a rectangle tt/2 x 1/2. The favorable events (a,d) (i. e. those for which //2sina > d) correspond to those points in the rectangle which lie under the curve 1/2 sin q (q being the variable of the x-axis). By 6.2.21, the area of the figure is
f / / - sin ft dft = —.
o   2 2
Thus, the wanted probability is (see ??)
2
7T
□
The following (known) problem, which also deals with geometric probability, illustrates that we must be cautious about what is assumed to be "clear".
10.D.2. Bertrand's paradox. What is the probability that a random chord of a given circle is longer than the side of an equilateral triangle inscribed into the circle? Solution. We will show three ways how to find "this" probability.
1) Every chord is determined by its center. Thus, a random choice of the chord is given by a random choice of the center. The chord is greater than the side of the inscribed equilateral triangle if and only if its center lies inside the concentric circle with half radius. The center is chosen "randomly" from the whole inside of the circle. Therefore, the probability that it will lie in the inner disc is given by the ratio of the areas of these discs, which is \.
2) Unlike above, we claim that the wanted probability does not change if the direction of the chord is fixed. Then, the centers of such chords lie on a fixed diameter of the circle. The favorable centers are those which lie inside the inner circle (see 1)), i. e., inside a fixed diameter of the inner circle. The ratio of the diameters is 1 : 2, hence the wanted probability is \.
3) Now, we observe that a chord is determined by its end-points (which must lie on the circle). Let us fix one of the endpoints (call it ^4)-thanks to the apparent symmetry, this should not affect the resulting probability. Then, the chord satisfies the given condition if and only if the other endpoint
Distribution function
Definition. The distribution function or cumulative distribution function of a random variable X is the function Fx ■ R -> [0,1] defined for all x e R by
Fx{x) = P{X <x).
The distribution function of a random vector (X1,..., Xk) is the function Fx   ■  Rh  —>  R defined for all vectors
x = (xu...,xk) eRfeby
Fx(x) = P(X1 <Xl,...,Xk <xk).
If it is clear from the context which distribution function is discussed, omit the random variable name and write simply
F{x).
The following theorem guarantees that, for every random variable, the probability that the value of X falls into any (fixed) interval (and thus into any Borel ftu\   set B) can be calculated purely from the knowledge of its distribution function.2
10.2.12. Theorem. For every random variable X, its distribution function F : R —> [0,1] has the following properties:
(1) F is a non-decreasing function;
(2) F has both side-limits at every point x G R, yet these limits may differ;
(3) F is left-continuous;
(4) at the infinite points, the limits of F are
lim Fix) = 1,        lim F(x) = 0;
x—Yea x—Y — oo
(5) the probability of X taking on the value x is given by
P(X = x)= lim F(y) - Fix).
y^YX+
(6) The distribution function of a random variable always has only countably many points of discontinuity.
Proof. The proof consists of quite simple and straightforward calculations. In particular, note that the events a < X < b and X < a are exclusive, so
P(a < X < b) = P(X <b)- P(X <a)= F(b) - F(a).
Hence the first property follows immediately from the definition of probability.
The next two statements follow from the probability of monotone sequences of events, discussed in 10.2.5. Fix a non-increasing sequence of numbers rn > 0 which converges to 0, and consider the events An given by X < x — rn. The union of these events is exactly the event A given by X < x. Of course, the event A does not depend on the choice of the sequence rn. By the first proposition of 10.2.5,
P(A) = lim PiAn).
In literature, the definition with the non-strict inequality F(x) — P(X < x) is often met. In this case, the probability P(X = x) is also included in Fx{x). Then, the distribution function has similar properties as those in 10.2.12, only it is right-continuous instead of left-continuous, etc.
681
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
lies on the shorter arc BC, where ABC is the inscribed equilateral triangle. However, the length of this arc is one third of the length of the entire circle, which means that the wanted probability is equal to |.
How is it possible that we came to three different probabilities? It is caused by a hidden ambiguity in the statement of the problem. It is necessary to specify what exactly it means to choose a chord "randomly". Each of the three results is correct provided the chord is chosen in the corresponding way. However, these ways are not equivalent; this is apparent not only from the different results, but also from the distribution of the chords' centers. In the first case, they are distributed uniformly throughout the inside of the circle. In the second and third cases, the centers are concentrated more towards the center of the circle.
□
10.D.3. Two envelopes. There are two envelopes, each contains a certain amount of money. We know that the amount in one of them is twice as great as in the other one. We can choose either of the envelopes (and take its contents). As soon as we choose one, we are allowed to change our mind and take the other envelope instead. Is it advantageous to do so?
Solution. At the first sight, it must not matter which envelope we choose. The probability of choosing the one which contains more is 1/2, so it is no good to change our choice.
However, consider the following reasoning: the envelope we have chosen contains a. Therefore, the other one contains a/2 or 2a, each with probability 1/2. This means that if we change the envelope, then we get a/2 with probability 1/2 and 2a with probability 1/2, i. e., the expected outcome is
1 a    1 5 ---1—2a = -a.
2 2     2 4
Therefore, it is wise to change the envelope. What is wrong with this reasoning?
There are several issues. Mainly, it is not generally true that if there is amount a in one of the envelopes, then the second one contains a/2 with probability 1/2. This depends on the initial distribution of the amounts that have been put into the envelopes, which is not precisely stated in the problem.
However, the paradox is rooted not only in the concealed a priori distribution. There are (even discrete) distributions for which the choice of changing the envelope always produces greater expected outcome than that of not changing it. Nevertheless, any distribution with this property must have
The distribution function is non-decreasing, and thus the left-sided limit equals the supremum. Thus, the left-sided limit Fx at x exists and equals p(A). This proves one half of proposition (2) as well as all of proposition (3).
Similarly, the above sequence rn can be used to define the events An by Xn < x + rn. This time, it is a non-increasing sequence Ai d A2 d ..., and its intersection is the event X < x. By the second property of 10.2.5,
p(a) = lim p(An) = p(X < x),
n—too
which verifies that the right-sided limit of F at x exists. At the same time, property (5) is proved.
The limit values of property (4) can be derived similarly by applying theorem 10.2.5, as shown for the one-sided limits above. In the first case, use the events An given by X < rn, for an arbitrary increasing sequence rn —> oo. Their union is the universal event fl. In the second case, use the events An given by X < rn, for any decreasing sequence rn —> — oo, and their intersection is the null event.
It remains to prove the last statement. As already shown, the discontinuity points of the distribution function are exactly those values x which the random variable has with nonzero probability, i.e., p(X = x) =^ 0. Now, let Mn denote the set of points x for which p(X = x) > i. Clearly, the set M of all discontinuity points equals the union of the sets Mn: M = U^L2- Since the sum of probabilities of mutually exclusive events cannot exceed 1, Mn can contain no more than n—l elements. Thus, M is a countable union of finite sets, thus it is countable. □
10.2.13. Probability measure. The probability that a random variable has a value lying in an arbitrarily chosen interval can be computed purely from the knowledge of its distribution function. The distribution function Fx thus defines the entire probability distribution of the random variable X.
How a particular random variable X is defined can be ignored. X can be viewed directly as a probability definition on the tr-algebra of all the Borel sets in R.
In this sense, every function F : R —> R satisfying the first four properties of the latter theorem is a distribution function of a unique random variable. Check the properties of the probability function defined on all intervals this way!
The probability obtained in this way is also called a probability measure on R. Similarly one deals with probability measures on the algebra of Borel sets in Rfe in terms of the distribution functions of the random vectors.
In this sense, a random variable or random vector can be considered without any explicit link to a probability space
(n,a, p).
10.2.14. Discrete random variables. Random variables behave substantially differently according to whether the non-zero probability is "concentrated in isolated points" or it is "continuously
distributed" along (a part of) the real axis.
682
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
infinite expected value (if the expectation is finite, then there is always a value which, when seen in the envelope, it is more advantageous to keep), so it is dubious to say that it is better to get "greater" infinity on average. □
E. Random variables, density, distribution function
10.E.1. Consider rolling a die. The set of sample points is fi = {oo1,... ,ooG}, where oo{ means that we have rolled number i. Further, consider the tr-field
A = {0, {cjucj2}, {oo3,004,005,oo6}, fi}.
Find whether the mapping X : fi —> R denned by
i) X(cji) = i for each i e {1,2,3,4,5, 6},
ii) X(m) = X(oo2) = -2,X(oo3) = X(oo4) = X(oo5) = = X(oo6) = 3
is a random variable with respect to A. Solution. First of all, we should make sure that the set A really satisfies all axioms of 10.2.2, i. e., that it is a well-defined tr-field. Then, by definition 10.2.10, a random variable is any function X : fi —> R such that the preimage of every Borel-measurable set B c R lies in A. As for the first case, consider the interval [2,3]. Since A"1 ([2,3]) = {oo2,oo3} A, we can see that the function X is not a random variable. In the second case, we can easily see that A is a random variable: Consider any interval in R. Then, exactly one of the four following occurs: 1) If the interval contains neither —2 nor 3, then the preimage in X is the empty set. 2) If it contains —2 but not 3, then the preimage is {oo1, oo2 }. 3) On the other hand, if it contains 3 but not —2, then the preimage is {oo3, u>4, oo5 , ooG }. 4) Finally, if it contains both these numbers, then the preimage is the whole sample space fi. In each case, the preimage lies in the tr-field A. □
10.E.2.   Consider  a tr-field  {(2, A),   where  fi =
{ooi,oo2,oo3,004,005} and
A = {0, {cux,^}, {oo3}, {004,005}, {oox,^,^}, {ooi,oo2,004,005}, {oo3,004,005}, fi}.
Find a mapping X : fi —> R, as general as possible, which is a random variable with respect to A. Solution. Since the events 001, oo2 do not occur individually in A, the random variable X must map them to the same number, i. e. X(lji) = X(oo2) = a for an a e R. For the same reason, we must have X(oo4) = X(oo5) = & for a b e R. If an interval contains both a and b, then its preimage
Discrete random variables
If a random variable X assumes only finitely many values x1, x2,..., xn e R or countably infinitely many values x\, x2,..., it is called a discrete random variable.
One can define its probability mass function f(x) by
/(*) =
(P(X 1°
otherwise.
Since the probability is countably additive and the singleton events X = x{ are mutually exclusive, the sum of all values f(xi) is given by either a finite sum or an absolutely convergent series
£/M = 1-
The probability distribution of a random variable X satisfies
P(X e B) = £ f(Xi).
In particular, the distribution function is of the form
Fx(t)=52f(Xi).
x,<t
Note the the distribution function F(x) of a discrete random variable is piecewise constant. F(x) = 1 for those x which are greater than all the Xi's.
Every random variable denned on a classical finite probability space is discrete.
10.2.15. Continuous random variables. Even if the values of a random variable X are not discrete, one can proceed similarly. Intuitively, increasing the value of x infinitesimally by dx, the density function f(x) of the random variable X can be perceived as
P(x < X < x + dx) = f(x)dx.
This means that whenever — 00 < a < b < 00, it is required that
P(a < X < b) = f f(x)dx.
Continuous random variables
A random variable X for which there exists a function / satisfying
Fx(b) = f f(x)dx,
is said to be a continuous random variable, and the function / is called its density function.
It is convenient to view the random variables as probability measures, cf. 10.2.13. Generally, this means not referring to any other sample space fi on which X would be denned as a function. X is represented just by its density or distribution function.
683
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
is {uji,uj2,uj4,uj5} G A, which is okay. Clearly, the event cj3 may be mapped to an arbitrary cel. Then, we can easily verify that the X-preimage of every interval is contained in A, i. e., X is a random variable with respect to A. □
10.E.3. Consider a random variable X which takes on value i with probability p(X = i) = |, for each i = 1,..., 6. Find the distribution function Fx (x) and draw its graph.
Solution. By definition 10.2.11, the distribution function is
Fx(x) = p(X < x). This means that Fx (x) = 0 for x < 1, Fx(x) = for 1 < x < 6 (where [x\ stands for the floor of x), and Fx (x) = 1 for x > 6. The graph looks as follows:
□
10.E.4. An archer keeps shooting at a target until he hits. He has 4 arrows at his disposal. In each attempt, the probability that he hits the target is 0.6. Let X be the random variable which gives the number of unused arrows. Find the probability mass function and the distribution function of X and draw their graphs.
Solution. Clearly, the probability of k consecutive misses followed by a hit is equal to 0.4fe ■ 0.6. Therefore, jxix) = p(X = x) = 0.43-x ■ 0.6 for x G {1,2,3}. If the archer misses three times, then there will be no arrow left at the end, no matter whether he hits the last time or not. Thus, fx(0) = p(X = 0) = 0.43.
Note that the distribution function F{x) of a continuous random variable X is always differentiable. Its derivative is the density function of X, i.e., F'(x) = j(x).
10.2.16. The general case. Of course, there are also random jjfi i, variables with mixed behaviour, where some part of the probability is distributed continuously, while there are values that are taken on with non-zero probability. This means, the probability measure of some singletons x G K is non-zero and still X is not a discrete random variable.
For instance, consider a chaotic lecturer who remains to stand at his laptop with probability p throughout the entire lecture, but once he decides to move, he happens to be at any position in front of the lecture room with equal probability.
Then, the random variable which corresponds to his position (assume that the desk with the laptop is at position 0 and the lecture room is bounded by the values ±1) has the following distribution function:
F(t)
0
iff < -1 iff G (-1,0)
>+±^(i + l)   iff G [0,1)
1
iff > 1.
The distribution function of all such variables can be expressed directly using the Riemann-Stieltjes integral
F(t) = f
J — cx.
f(x)d(g(x)),
developed in subsection 6.3.15 (page 432). In the example above, choose j(x) = \ and
for x < — 1 for -1 < x < 0 for 0 < x < 1 for x > 1.
This corresponds again to the idea, that the distribution function is equivalent to a probability measure. Thus the measure of any interval is given by integrating its indicator function with respect to this measure. This is what the Riemann-Stieltjes integral achieves.
The Riemann integral corresponds to the choice g(x) = x. One could add only the jump p at x = 0 (i.e. g(x) = x for x < 0, while g(x) = x + p otherwise) and leave the constant density to / (x), which would be nonzero only on [—1,1]. This corresponds to splitting the probability measure into its discrete part (hidden in g) and continuous part (expressed by the probability density).
Notice that any distribution function can have only count-ably many points of discontinuity.
10.2.17. Basic discrete distributions. The requirements on the properties of probability distributions of random variables are based on the modeled situations. Here is a list of the simplest discrete distributions.
684
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
By the definition of the distribution function (see 10.2.11), we have
Fx(x) = P(X<x)
'o
0.43 = 0.064 = { 0.43 + 0.42 ■ 0.6 = 0.16 0.43 + 0.42-0.6 + 0.4-0.6: 1
0.4
for x < 0,
for a; G (0,1],
for a; G (1,2],
for a; G (2,3],
for x > 3.
Degenerate distribution
The distribution which corresponds to a constant random variable X = p, is called the degenerate distribution Dg(/i).
Its distribution function Fx and probability mass function fx are given by
Fx(t) =
t < n t > jj,
fx(t) =
t = [i otherwise.
The graphs of the probability mass function and the distribution function are as follows:
10.E.5.
Bodoiry graTi Prom7 proti Promfi TafcmftaT I0w*30c
Tabulhal 10v'30c
□
The distribution function of a random variable X is
'o for x < 3
1
Fx(x) -
i) Justify that Fx is indeed a distribution function.
for 3 < x < 6 for 6 < x.
Here follows a description of an experiment with two possible outcomes called success and failure. If the probability of success is p, then the probability of failure must be 1 — p.
It is convenient to take the values 0 and 1 for the two possible results.
Bernoulli distribution
The distribution of a random variable X which is 0 (failure) with probability q = 1 — p and 1 (success) with probability p is called the Bernoulli distribution A(p).
Its distribution function Fx and probability mass function fx are given by
Fx(t) =
fx(t) =
Further, consider a random variable X which corresponds to n independent experiments described by the Bernoulli distribution, where X measures the number of successes. Clearly the probability mass function is non-zero exactly at the integers t = 0,..., n, which correspond to the total number of successes in the experiments (the order does not matter).
The probability that t successes are encountered in t chosen experiments out of n is p*(l — It is necessary to sum all the (") possibilities. This leads the the binomial distribution of X:
Binomial distribution
The binomial distribution Bi(n,p) has probability mass function
fx(t)
0
*€{0,l,...,n} otherwise.
The illustration shows the probability mass functions for Bi(50,0.2), and Bi(50,0.9). The distribution of the probability corresponds to the intuition that most outcomes occur near the value np:
685
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
ii) Find the density of the random variable X.
iii) Compute P(2 < X < 4).
Solution, a) Clearly, Fx is continuous and non-decreasing. Moreover, we have  lim F(x) = 0 and lim F(x) = 1, as
x—Y — oo x—Yao
needed.
b) By 10.2.14, the density of a continuous random variable is the derivative of its distribution function. We can see that on the interval (3,6), the density is equal to j(x) = i, while on the intervals (—oo, 3) and (6, oo), it is equal to zero. Therefore, the variable X has uniform distribution, see 10.2.20.
c) We have from the definition of the distribution function that P(2<X<4)=JFbc(4)-JF'x(2) = |-l = i. □
10.E.6. Consider a random variable X and a function / : R —> R given by j(x)  = for x e R, where a is a
parameter. Suppose that / is the density of X. Find
i) the value of a,
ii) the distribution function of X,
iii) P(-l < X < 1).
Solution, a) If the function / is to be a probability density, then its integral over R must be equal to one. This yields the condition
a
1
;dx = a [arctg x]      = cnr.
,1 + x2 Hence a = -.
b) By 10.2.14, the distribution function is given by the following integral:
fx 1   fx     dt       1 1
Fx(x)=        f(t)dt = - —-J = -arctga:+-.
c) By b) and the definition of the distribution function, we have
P(-l <X<1)= Fx(l)-Fx(-1) =
TT  4     TT  V 4
□
10.E.7. The joint probability mass function of a discrete random vector is given by the following table:
Next, Consider distributions similar to the Bernoulli process referred to in 10.2.1. Consider independent experiments with the Bernoulli distribution A(p), as in the case of the binomial distribution, and fix a positive integer r. Repeat the experiment until r successes occur.
The random variable X is denned as the number of failures before the r-th success. In the case of r = 1, it is exactly the example from 10.2.1. TheeventX = k occurs if and only if there are exactly r — 1 successes in the first k + r — 1 experiments and the (k + r)-th experiment also ends with a success. Thus the following probability mass function is arrived at:
Geometric distribution
The random variable X which corresponds to the number of failures before reaching the r-th success has probability distribution
P(X = k) =
Y X	2	5	6
1	l 5	l 10	l 20
2	1 10	l 20	0
3	3 10	1 20	3 20
This is called the negative binomial distribution. In the case of r = 1, it is the geometric distribution.
Often the same definition is used with the successes and failures interchanged. This results in the same formula for the probability mass function with p and 1 — p interchanged.
The geometric distribution appears in physics in connection with Einstein-Bose statistics.
10.2.18. Poisson distribution. In practice, the binomial distribution often leads to further model problems.
Consider the situation that r (mutually in-distinguishable) objects are to be divided into tyr^tgp^-- n (distinguishable) boxes, and each object is equally probable (i.e., has probability 1/n) to fall into any of the boxes.
The random variable which describes the number X of objects in one fixed box can be described as follows: The admissible values are X = k, where k = 0,..., r, and the individual probabilities are
k  / h \ r — k
P(x = k) = (rM±) d-
(n-iy
Find
i) the marginal distribution and probability mass functions;
Thus, the distribution of X is of the type Bi(r, 1/n).
Such a variable can be encountered, for example, when describing a physical system with a huge number of gas molecules. The boxes represent small volumes of the space.
686
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
ii) the joint distribution function and draw it in a suitable way;
iii) P(Y > 3X).
Solution, a) By 10.2.22, the marginal distribution of the random variable X is obtained by summing up the joint probability mass function over all possible values of Y in each row. Similarly, the marginal distribution of Y is obtained by summing up the entries in each column. Thus, we get the following:
X	1	2	3
fx	Y 20	3 20	1 2
and
Y	2	5	6
fv	3 5	i 5	i 5
b) The joint distribution function is at point (a, 6) equal to the sum of all values of the joint probability mass function f(x,y) such that X < a and Y < b. This corresponds to values of the subtable whose lower-right corner is (a, 6). Precisely, the joint distribution function F(x,y) looks as follows:
F(x,y)	[2,5)	[5,6)	[6,oo)
[1,2)	i 5	3 10	Y 20
[2,3)	3 10	y 20	1 2
[3,oo)	3 5	4 5	1
and on intervals (—oo, 1) xl and R x (—oo, 2), F( clearly zero.
c) Apparently, P(Y > 3X) = P(X = 1, Y = 5) + P(X = 1 y = 6) = = — + — = — □
' u/ 10 ^ 20 20
10.E.8. Find the probability P(2X > F) provided the density of the random vector (X, F) is given by:
\ [Ax - y)   for 1 < x < 2, 2 < y < A, otherwise.
f(x,Y)(x,y) = < ö
Solution. By definition, we have
fOO fix
/oo rZX / f{x,Y)(x,y)dydx -oo ^ —oo
oo —oo 2 f2x
6
(Ax — y)dydx =
1
3"y-12y
da; ■
.4 1',, x - -x + - \ dx =
1
1   3      2 2
-a;J--a; + -x
3       3 3
□
Observe the distribution of the molecules. Then, the behaviour of Xn as the number n of boxes as well as the number rn of objects increases so that their ratio rn/n = A remains constant is of interest. In other words, every box is to contain (approximately) the same number A of elements, on average.
We are interested in the asymptotic behaviour of the variables Xn as n —> oo. Letting lim^oo rn/n = A, the standard procedure (with details to be added - take it as a challenge to recall the methods from the analysis of univariate functions!) leads to:
lim P(Xn = k)= lim (rA (n'1r)'"""fc
= lim »-n(rn-l)-(r        +        /  _1- " n-s-oo (n — l)ft k\ \ n
Xk iim A+ZVV"=^ Hm A +
k\ n-k\
k\
m
since the functions (1+x/n)n converge uniformly to the function ex on every bounded interval in R.
Poisson distribution
The Poisson distribution Po(A) describes the random variables with probability mass function
fx(k)
Of course,
fc=0
keN otherwise.
Y —
^ k\
-A+A
As seen above, this discrete distribution Po(A) with an arbitrary A > 0 (distributed into infinitely many points) is a good approximation of the binomial distributions Bi(n, A/n), for large values of n.
10.2.19. Two examples. Besides the physical model men-tioned above, such a behaviour can be encoun-' tered when observing occurrences of events in
a space with constant expected density in a unit volume. Observing bacteria under a microscope, when the bacteria are expected to occur in any part of the image with the same probability, provides an example. If the "mean density of occurrence" in a unit area is A and the whole region is divided into n identical parts, then the occurrence of k events in a fixed part is modeled by a random variable X with the Poisson distribution. When diagnosing in practice, such an observation allows us to compute the total number of bacteria with a relatively good accuracy from the actual numbers in only several randomly chosen samples.
687
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.E.9. Find the marginal distribution function and the joint and marginal density of the random vector (X, Y) provided
Ft
(XX
0
(z,y) = { \x2y2
for x < 0, y < 0
forO < x < 1,0 < y < 2
for a; > l,y > 2
Solution. The density of the random vector (X, Y) is obtained by differentiation with respect to x a y. Thus, for 0 < x < 1, 0 < y < 2, we have f(xx)(x'y) ~ xy< ^ elsewhere the density is zero. The marginal density of the random variable X is then
/oo f2 \
f(xx)(x>y)dy = / xvdv = hxy2]o =2a;-■oo JO Z
Similarly, for Y, we get jy (y) = \y. The marginal distribution functions are
and
Fx(x) = £00fx(t)dt = ß2tdt = .
Fv(y) = JV iv(i)dt = J* \tdt = \y2. □
10.E.10. In a bag, there are 14 ballsy red, 5 white, and 5 blue ones. We randomly take 6 balls out of the bag (without replacement). Find the distribution of the random vector (X, Y) where X stands for the number of red balls taken and
Y for the number of white balls. In addition, find the marginal distributions of X and Y. Then, compute P(X < 3), P(l <
Y < 4).
Solution. The value of the probability mass function at point (x, y) is denned as the probability P(X = x, Y = y), i. e. the probability of taking x red balls and y white balls. The number of ways how to take x red balls is (4); for y white balls, it is (f); and the remaining 6—x—y blue balls can be se-
lected in I
/4\ (5\
K6-x-y) ways. Altogether, there are [xJ yyJ y6_x_yJ possibilities. The values of this expression for all x, y are, in the following table.
x\y	0	1	2	3	4	5	^x
0	0	5	50	100	50	5	210
1	4	100	400	400	100	4	1008
2	30	300	600	300	30	0	1260
3	40	200	200	40	0	0	480
4	10	25	10	0	0	0	45
£y	84	630	1260	840	180	9	3003
The values in the last column and row are the sums over all values of y and x, respectively. Then, the values of the probability mass function are obtained after dividing by the number of all possibilities how to take the 6 balls, i. e. (g4) = 3003.
The second example is more challenging. We describe events which occur randomly at time t > 0. Here, the probability of an occurrence in the following small time period of length h does not depend on what had happened before and equals the same value hX for a fixed A > 0. At the same time, the probability that the event occurs more than once in a given time period is small.
Let Xt denote the random variable which corresponds to the number of occurrences of the examined event in the interval [0, t). The requirements are expressed infinitesimally. We want:
• the probability of exactly one event in each time period of length h equals hX + a(h), where the function a(h) satisfies lim/l^o+       = 0;
• the probability /3(h) of more than one event occurring in a time period of length h to satisfy lim^^0+ ^fp- = 0;
• the events Xt = j and Xt+h —Xt = to be independent for all j, k e N and t, h > 0.
Use the notation pk(t) = P(Xt = k), k e N, and set the initial conditions po(0) = 1 andpfc(O) = 0 for k > 0. Compute directly
Po(t + h)= p0(t)P(Xt+h -Xt = 0) = = p0(t)(l-h\-a(h)- (3(h))
and similarly,
pk(t + h) = P(Xt = k,Xt+h -Xt = 0)
+ P(Xt = k-l,Xt+h-Xt = l) + P(Xt<k-2,Xt+h = k) = Pk(t)P(Xt+h -Xt = 0) +pk.1P(Xt+h -Xt = l)
k-2
+ YJP{Xt = i,Xt+h-Xt = k-i)
■Pk(t)(l
i=0
-hX
k-2
Q(/l)-/3(/l))+pfc_1(i)(/lA + Q(/l))
+ Y,Pi(t)P(Xt+h-Xt=k-i).
i=0
Hence (similar to in 6.1.16, page 391, the symbol o(h) is written for expressions which, when divided by h, approach zero
as h -> 0+)
Po(t + h) -po(t) h
Pk(t + h) -pk(t) h
-Xpo(t) + ^o(h)
= -Apfc(f) + \pk-
1(t) + -o(h).
Letting h —> 0+, an (infinite!) system of ordinary differential equations is obtained:
p'0(t) = -XPo(t),   po(0) = l
p'k(t) = -XPk(t) + Apfc_i(t),   pfc(0) = 0
for all t > 0 and k e N, with an initial condition. The first equation has a unique solution
p0(t) =e"At,
688
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The marginal distributions of X and Y correspond to the last which can be immediately substituted into the second equa-
column and row, respectively. ti°n- Tlus leads to
The probability P(X < 3) can be calculated easily from the Pi (t) = Xte xt.
marginal distribution of X: A trivial induction argument shows that the system has a
P(X<3)=FX(S)
1
3003
(210+1008+1260+480) = 0.985
unique solution
Similarly, for the probability P(l < Y < 4), we have
P(l < Y < 4) = FY(i) - FY(l) =
= ^^(630 + 1260 + 840 + 180) = 0.969.
□
Pk(t) =
(Xtf k\
t>0, keN.
10.E.11.   The density of a random vector (A, Y, Z) is
It is thus verified that for every process which satisfies the three properties above, the random variable Xt which corresponds to the number of occurrences in the time period [0, t) has distribution Po(Ai).
In practice, these processes are connected with the failure rate of machines.
10.2.20. Continuous distributions. The simplest example c(x + y + z) lor Q < x < \,Q < y < \,Q < z < \ ^ Qf a contmuous distribution is the uniform dis 0 otherwise.
Find the value of the parameter c as well as the distribution function of the vector, and compute P(0<A<|,0<y< ±,0<Z<±).
Solution. The integral of the density over the entire space must be equal to one. This gives us
tribution of the probability throughout a fixed «tiinterval. This is also a good illustration of the fact that a simply formulated requirement does not leave many free choices in the definition. Now, the probability of X taking on a value inside an interval which is included in the sample interval (a, 6) c K is required to be dependent only on the length of the interval, but not on its actual position. This means that the density fx of the random variable X should
1 = J1 c(x + y + z)dzdydx = c Q(x + y + ^dyobe^eonstant and the value of this constant is given by the re-= c LAx + l)dx = §c. quirement P(a < X < b) = 1.
By definition, the distribution function is
= cJo( Hence, c = |. equal to
Fx (x, y,z) = l J*      f0z(r + s + t)dtdsdr =
Uniform distribution
For any real numbers a, b, — oo < a < b < oo, define the density and distribution function as follows:
= I So Io(rz + sz+ \z2)dsdr = | j0X(rzy + \y2z + \z2y)dr = = ^{^x2zy + \y2zx + \z2yx) = ^{x2zy + y2zx + z2yx),    fx(t) =
so the wanted probability is
P(0 < X < |,0 < Y < |,0 < Z rrfl I I"\ — A. n
t < a 1	f°	t < a
t e (a,b) Fx(i) = I	1 t-a I b—a	t G (a, b)
t>b,	i	t > b.
2' — 2'
V2' 2' 2) ~ 16'
10.E.12. Find the value of the parameter a so that the func tion
/(*)
0 for s<l
a ln(x)   for   1 < x < 2 [0 for   2 < x
would be the probability density of a random variable.
Solution. We know that the condition for the function to be a density is
/oo /(*) = 1 -oo
Thus, we have to calculate J ln(x) dx:
Here, the random variable X has uniform distribution.
The next distribution is similar to the discrete Poisson distribution. Suppose the occurrence of a random event is observed such that its occurrences in non-overlapping intervals are independent. Thus, if p(t) is the probability of the event not occurring during an interval of length t, then of necessity p(t + s) = p(t)p(s) for all t, s > 0. Moreover, assume that p is differentiable andp(0) = 1.
Then, ]np(t + s) = lnp(f) + lnp(s). Letting s -> 0+ (and applying l'Hospital's rule),
\np(t + s) - \np(t)
(In(p))'(*)
lim
S-S-0+
lim
S-S-0+
(lnp(S))' _ p'(0)
= p'(0).
ln(x) dx Altogether,
poo
/ m =
xln(x)— I ldx = xln(x)— x = x(ln(x) — l).
a\n(x) = a[a;(ln(a;)-l)]2 = a(21n(2)-l),
1 p(0) ( Note: A > 0, and p'(0) cannot be
Thus,p'(0) = -X e positive as p(0) = 1).
Then, p(t) satisfies \np(t) = -Xt + C. The initial condition leads to the only solution
p(t) = e~xt.
689
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
so a ■
l
2 ln(2)-l 1
□
10.E.13. A child has become lost in a forest whose shape is that of a regular hexagon. Suppose that the probability that the child happens to be in a given part of the forest is directly proportional to the size of that part, but independent of its position in the forest.
• What is the probability distribution of the distance of the child from a given side (extended to a straight line) of the forest?
• What is the probability distribution of the distance of the child from the closest side of the forest?
Solution.
• Let a be the length of the sides of the hexagon (forest). Then, the probability distribution satisfies
Now, consider the random variable X which corresponds to a (random) moment when the event occurs for the first time. Apparently, the distribution function of X is given by
/(*)
0
A-r + 2
for x < 0
for 0 < x < i \^3a
"^x + ~7?~~   f°r \^>a < x < \^3a
0
V3a 2
for x > \^3a
as for the first question.
First, let us compute the distribution function F of the wanted random variable X that corresponds to the distance of the child from the closest side. The distance can be anywhere in the interval / = (0, ^a). Then, for
y £ I, we have
F(y) = P[X <y] = Altogether,
Vsr,2 _ (^g-y)2 V3„2
4 "
= 1-
3a2
F(V) =
0
1 -1
for y < 0 3a2 forye(0,^a) fory >
4(#a-j/
Thus, the density, being the derivative of the distribution function, satisfies:
0
/(*) =
for x < 0 s-^=y± forye<0,^a}
3a2
fory >
□
10.E.14. Let a random variable X have uniform distribution on an interval (0, r). Find the distribution function and probability density of the volume of the ball whose radius is equal to X.
Fx{t) = \-p{t) =
l-e 0
t < 0.
This function has the desired properties: It has values between zero and one, it is increasing and it has the required behaviour at ±oo. The density of this random variable can be obtained by differentiation of the distribution function.
Exponential distribution
The distribution corresponding to the continuous random variable X with density
'Ae"A* t>0
fx(t) =
0
t < 0.
is called the exponential distribution ex(A).
The exponential distribution belongs to the more general family of important distributions with the densities of the form
-le-bx
for x > 0, with given constants a > 0, b > 0, while the constant c is to be computed. The following expression is required to equal one:
c/"1 e-"x dx= c
iV 1 e-t\dt = ^r(a). Jo Jo     \bj b b"
r is the famous transcendental function providing the analytic extension of the factorial function, discussed in 6.2.17 on the page 411.
4(^a-y)2
Gamma distribution
The distribution whose density is zero for x < 0, while for x > 0. It is given by
/(*)
TJÖ)
called the gamma distribution T(a, b) with parameters a > 0, b> 0.
Thus, the exponential distribution is the special case of this one for the value a = 1.
10.2.21. Normal distribution. Recall the binomial distribution. If the success rate p is left constant, but the number n of experiments is increased, the probability mass function keeps its shape (although the scale changes). As n increases, the values of the probability mass function merges into a curve that should correspond to the density of a continuous distribution which is a good approximation for Bi(n, p) for large values of n.
Recall the smooth function y = e~x I2, mentioned in subsection 6.1.5 (page 377) as an appropriate tool for the construction of functions which are smooth but not analytic. The
690
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Solution. First, we find the distribution function F (for 0 <
F(d) = . Altogether,
-ttX3 < d		, 3d
	= P	X < \ —
3		V 47T
F{x)=l ^
3
47rr3
1
for   x < 0
for   0 < x < |7rr3
for   x > |7Tf3
Differentiating this, we obtain the density:
{0 for s<0
0
„-1 , x 3   for   0 < x < f 7rr3
for   X > |7Tf3
□
10.E.15. Find the value(s) of the parameter a e R so that the function
C 0      for i<0
f(x) = I  ax2   for   0 < x < 3 { 0      for i>3
defines the probability density of a random variable X. Then, find its distribution function, probability density, and the expected value of the volume of the cube whose edge-length has probability density determined by /.
Thus, the distribution function of
1+3
Solution. Simply, a =
the random variable X is Fx (t) = ^t3 for t e (0, 3), zero for smaller values of t, and one for greater. Let Z = X3 denote the random variable corresponding to the volume of the considered cube. It lies in the interval (0,27). Thus, for t e (0,27) and the distribution function Fz of the random variable Z, we can write Fz(t) = P[Z < t] = P[X3 <t] = P[X < yi] = Fx{^t) = ±t. Then, the density is fz{t) = ■^f on the interval (0,27) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to 13.5. □
10.E.16. Find the value(s) of the parameter a e R so that the function
C 0    for s<0
f(x) = <   ax   for   0 < x < 3 { 0    for   x > 3
defines the probability density of a random variable X. Then, find its distribution function, probability density, and the expected value of the area of the square whose side-length has probability density determined by /.
illustration compares this curve (in the right hand part) to the values ofBi(40, 0.5).
This suggests looking for a convenient continuous distribution whose density would be given by a suitably adjusted variation of this function.
The function e~x I2 is everywhere positive, so it suffices to compute e~x I2 dx. If this results in a finite value, just multiply the function by its reciprocal value. Unfortunately, this integral cannot be computed in terms of elementary functions. Luckily, multidimensional integration and Fu-bini's theorem can be used. Transform to polar coordinates, to obtain
e
■ 2
OO P-2-K
^x+y^2dxdy
e-(>-2)/2 rdrdQ
lo Jo = 2tt
(cf. the notes at the end of subsection 8.2.5, verify that the integrated function satisfies the conditions given there, and compute that thoroughly!). Hence the integral results in v7^, so the function j(x) = -j= e~x I2 is a well-defined density of a random variable.
Normal distribution
The distribution of the random variable Z with density
tp(z)
1
-z2/2
2tt
is called the (standard) normal distribution N(0,1). The corresponding distribution function
1
~x '2dx
cannot be expressed in terms of elementary functions.
It is called the Gaussian function and the graph of ip(x) is often called the Gaussian curve.
So far, the correct density which approximates the binomial distribution is not found. The diagram that compares the probability function of the binomial distribution to the Gaussian curve shows that the position of the maximum must be moved as well as an application of shrinkage or stretch to the curve horizontally. The first goal is easily reached by constant
691
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Solution. We proceed similarly as in the previous example. Again, we can easily find that a = |. Thus, the distribution function of the random variable X is Fx(t) = \t2 for t e (0,3), zero for smaller values of t, and one for greater. Let Z = X3 denote the random variable corresponding to the area of the considered square. It lies in the interval (0,9). Thus, for t e (0,9) and the distribution function Fz of the random variable Z, we can write Fz(t) = P[Z < t] = pyX2 < t] = P[X < Vi] = Fx(Vi) = \t. Then, the density is fz (t) = ^ on the interval (0,9) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to 4.5. □
10.E.17. Find the value(s) of the parameter aelso that the function
/(*)
0
0
for   x < 0 for   0 < x < 2 for   x > 2
defines the probability density of a random variable X. Then, find its distribution function, probability density, and the expected value of the volume of the cube whose edge-length has probability density determined by /. O
10.E.18. We randomly cut a line segment of length / into two pieces. Find the distribution function and the density of the area of the rectangle whose side-lengths are equal to the obtained pieces.
Solution. Let us compute the distribution function: Let X denote the random variable with uniform distribution on the interval (0, /), which corresponds to the length of one of the pieces (then, the length of the other piece isl — X). The area S = x(l—x) of the rectangle, for a; G (0, /), can lie any where in the interval (0, Z2/4). Setting d e (0, Z2/4), we can write
F{d) = P[S <d]= P[X(l -X)<d]
Thus, we are looking for those values of x for which x(l — x) < d, which is a quadratic inequality. The roots of the corresponding quadratic equation are l-^l2~4d
and
l+VP-4d
2 2
The inequality is satisfied by exactly those values of x which lie outside this interval. Therefore,
P[x(i-X) < d] = P[x e (0,0 \ (
shift p of the variable z, while scaling the difference x — p by coefficient a > 0 does the rest. Thus, there are two real parameters p and a > 0 and the density function is of the form:
S^)=e-(^)2/(^). Simple variable substitution leads to
e-(x-M)V(2CT2) dx = U^
)
Thus there is an entire two-parametric class of densities
U \Z2~7T
of random variables. The corresponding distributions are denoted by N(/i,cr2).
We return to the asymptotic closeness of the normal and binomial distributions for n —> oo after creating suitable tools. The following illustration reveals, how well this works. The discrete values correspond to Bi(40,0.5), while the curve depicts the density of N(20,10).
10.2.22. Distributions of random vectors. As for the scalar random variables, one defines the distribution functions and the density or the probability * •' mass function for continuous and discrete random vectors. There are joint probability mass functions and densities.
For two discrete random variables, i.e. a discrete vector (X, Y) of random variables, define their (joint) probability mass function
f(x,y)
P(X = XiAY = yj)   x = Xi,y = yj
0
otherwise.
/ -VI2- 4d l + VW
A random vector (X, Y) is called continuous, if its distribution function is defined as for continuous random variables. This means, for all a, b e K,
F(a,b) = P(X <a,Y < b) = - 4dxl = /     / f(x,y)dxdy,
oo j — OO
/ - VI2 - 4d
\Jl2 — 4d and the function f(x, y) is called the (joint) density of the
I random vector (X, Y).
692
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
I
for   x < 0 for   0 < x <
for   x >
ľ
The density is obtained by differentiation:
0 for i<0
0
for   0 < x < l-
for   x >
ľ
10.E.19. Nezávislé náhodné veličiny X a Y mají následující hustoty pravděpodobnosti:
ÍO   forí<0, ÍO forí<0,
1 for 0 < t < 1, fy (t) = < 2t for 0 < t < 1, 0   forl<í, [o forl<í.
Určete distribuční funkci náhodné veličiny udávající obsah obdélníka o stranách X a Y. Solution.
í 0 for   t < 0
FY(t) = \  2t-t2   for   0 < t < 1
1
for   K t
□
10.E.20. Let X, Y be independent random variables, where X has uniform distribution on the interval (0,2) and Y is given by its density function:
C 0    for i<0
f(x) = I  2x   for   0 < x < 1 { 0    for i>1.
Find the probability that Y is less than X2.
Solution. Since X and Y are independent random variables,
the joint density f(x;
of the variable (X, Y)
is given by the densities fx and fy of the individual random variables. Thus, we have
f(X,Y)(u,v) =
' fx(u) ■ fY(v) = \-2v = v   for (u,v) e (0, 2) x (0,1 0 otherwise.
Then, the wanted probability P is the integral of the density f(x,Y) over the part O of the plane where Y < X2:
P = J J f(x,Y) dxdy = l-JJ f(X,Y) dacly =
o K2\0
3 5'
For a general continuous random vector X (X1,... ,Xn), define
F(a1: ...,an) = P{Xi < ai,..., I„ < a„) =
= 1-1    I   y dy dx =
o Jx
5 Xji) dx\ ■ ■ ■ dxn,
and similarly for discrete random vectors with more components.
A random vector (X, Y) with both X and Y continuous is not always a continuous vector in the above sense. For ex-
□ ample, taking a continuous variable X, the random vector (X, 2X) is neither continuous nor discrete, since the entire probability mass is concentrated along the line y = 2x in the plane, but not in individual points.
The marginal distribution for one of the variables can be obtained by summation or integration over the others.
For instance, in the case of a discrete random vector (X, Y), the events (X = x{, Y = yf) for all possible values x{ and yj with non-zero probabilities for X and Y, respectively, form an exhaustive collection of events for the vector (X,Y). Thus
OO
P(X = xi) = Y/P(X = xi,Y = yj), j=i
which relates the marginal probability distribution of the random variable X to the joint probability distribution of the random vector (X, Y). In the case of continuous random vectors, proceed similarly using integrals instead of sums.
10.2.23. Stochastic  independence. It is  known from subsection 10.2.3 what (in)dependence means for events. Random variables X1, ■ ■ ■ , Xn are (stochastically) independent if and only if for any a{ e K, the events X1 < a1, ..., Xn < an are independent. In view of the definition of the distribution function F of the random vector (X1,..., Xn), this is equivalent to
F(x1, ...,xn)= FXl{xi) ■ ■ -Fx„{xn), where Fx, are the distribution functions of the individual components.
It follows that the events corresponding to Xk e Ik for arbitrarily chosen intervals Ik is also independent. The probability of X1 e [a, b) and simultaneously X{ e (—oo,c{) for the other com-'ponents is F(b,c2,... ,cn) - F(a,c2,... ,cn) = (FXl(b) - FXl{a))Fx2{^)---FXn{cn), and so on. The densities and probability mass functions behave well too:
Proposition. For any random vector (Xi,..., Xn), the following two conditions are equivalent:
• The random variables X\..., Xn are stochastically independent.
• The joint distribution function F of the of random vector (Xi,..., Xn) is the product of the marginal distribution
□ functions Fx, of the individual components.
693
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.E.21. Let X, Y be independent random variables, where X has density function
C 0    for s<0 = I  2x   for   0 < x < 1 { 0    for   x > 1,
and F has density function
C 0   for   x < 0
/2(a;) = }   I   for   0 < x < 2 [ 0   for   x > 2.
Find the probability that F is greater than X2. O
Solution. f(x,y)(u,v) = uv, for («,«) £ (0,1) x (0,2), S(x,y) (u, v) = 0 otherwise. Then, the wanted probability
P = Jo f^xydydx = ^i.
□
10.E.22. Let X, F be independent random variables, where X has density function
C 0    for i<0
f(x) =1  ?f   for   0 < x < 3 { 0    for   x > 1,
and F has density function
C 0   for s<0
f(x) = 1   §   for   0 < x < 2 [ 0   for x>2.
Find the probability that F is greater than X3.
Solution.
p = J<T &xydydx = #
o
□
F. Expected value, correlation
Compute the expected value and variance of the binomial distribution.
Solution. The direct calculation from the definitions is a nice exercise on combinatorics. We prove this statement using the properties of the expected value and variance. Using the definition of the binomial distribution (see 10.2.17), we can view the random variable X ~ Bi(n,p) as the sum X = Y^k=i ^fc> where Y1,..., Yn ~ A(p) axe independent random variables saying whether the fc-th experiment was successful. Clearly, the Bernoulli distribution has expected value E F = p, hence by theorem 10.2.29, we have EI = Y2=i E Yk = nP- Similarly, we compute E(Ffc2) = l2 ■ p + 02 ■ (1 — p) = p, so varFfc = E(Ffc2) - (EFfc)2 = p - p2. By theorem 10.2.33, we have var X = J3fc=i var ^fc = np(l — p). □
Moreover, if all Xt are discrete random variables, then they are independent if and only if the joint probability mass function f of the random vector (Xi,..., Xn) is the product of the marginal probability mass functions fx, of the individual components.
Similarly, if all Xi are continuous random variables, then they are independent if and only if the joint density function f of the random vector (X\,..., Xn) exists and it is the product of the marginal density functions fx, of the individi-ual components.
In particular, any random vector with independent continuous components is again a continuous random vector.
Proof. Many of the claims are already verified. The only nontrivial implication left is the one assuming the product formula for the joint distribution function and deriving the claim on the probability function or the density. The argument for n = 2 is shown below, the general case is analogous.
Consider first two discrete independent random variables X,Y. Then ./v.y:,-,.,,, = P(X = x,,Y = Vi) = P(X = xi)P(Y = y-j) = fx(xi)fY(yj)- The joint distribution function is
Fx,v(x,y) = fx{xi)fY{y3
x,<xyj<y
<x ' \j<y '
which is equivalent to fx,v (x, y) = fx (x) fy (y).
Similarly, assuming that the joint distribution function Fx,y is the product of the distributions functions of two continuous random variables X, F, then its mixed partial derivatives exist. Thus is set:
d2
fx,v(x,y) = TTTrFx,Y{x,y)
dxdy d2
Fx(x)FY(y)
dxdy = fx(x)fY(y),
which is the requested joint density function for Fx,y-
All the other implications are either direct consequences of the definitions or are obvious. □
10.2.24. Example. Consider a simple example which illustrates that it is not a good idea to view a random vector only as a pair of random variables. Consider stochastic properties of a random vector (X, F) which has continuous uniform distribution on the unit disc in the plane R2, centered at the origin. Then, its (joint) density function is
f(x,y)
i x2 + y2 < 1 0 otherwise.
The components X and F of this random vector (in the usual Euclidean coordinates) cannot be independent random variables: For instance, note that the probability of (X, F) falling outside the unit disc but inside the square with vertices at
694
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.F.1. An archer shoots five arrows at a target. Each time, the probability he hits is 0.6, and the individual results are independent. Let X be the random variable which corresponds to the number of hits. Determine its distribution and find its expected value and variance.
Solution. Clearly, the shots are independent experiments with the Bernoulli distribution Thus, by the definition of the binomial distribution, we have X ~ Bi(5, |). ByF.the expected value and variance of Bi(n, p) are equal to np and np(l—p), respectively, which gives EI = 3 and var X = |
for our case.
□
10.F.2. Consider the discrete random variable X which takes on the values k = 0,1,2,3,..., each with probability P(X = k) = p(l — p)k (geometric distribution). Find EX (the expected number of failures before the first success) and var X.
Solution. Using the definition of the expected value and the formula for summing the derivative of a geometric series, we calculate
Ei = ^^(i-Pf=P(i - p) y, *(i -vf-1
1-p
k=0
k=0
= p(1 -P)— = PZ
P
Similarly, using the formula for summing the second derivative of a geometric series, we compute
E(X2
X>2p(i-p)*
(l-p)(2-p)
k=0
hence the variance is var X = E(X2) — (E X)
2 _ l_p V2 ■
□
10.F.3.   A random variable X is defined by its density fx(x) =    for x e (1, oo) and fx(x) = 0 elsewhere. Find its distribution function, expected value, and variance. Solution. By the definition of the distribution function, we
have, for x e (1, oo),
Fx(x) =
t4
dt =
= 1 -
1
The expected value of X is equal to
EX =
-dx =
11 X''
and the expected value of X2 is
3
~2^
e(x2;
Therefore, varX = 3 - (|)2 = §.
,oo 3	3'	oo
/    —rdx =		
Ji	X	1
□
(±1,±1) is zero, while the marginal distribution functions are non-zero for the values \x\ < 1 and \y\ < 1.
Expressing this random vector in polar coordinates
era    r<po ^
P{R<r0,$ <p0)= /    / -Jo   Jo 71
rdpdr = ^r^o^o-
10 jo
The joint density of the vector (R, <P) is thus f(r, p) = J for 0<r<l,0<p< 27r, and it is zero otherwise. The marginal densities are
r2ir „
h(<p)
— dp = 2r,    if 0 < r < 1,
7T
■ dr ■
1
2^'
if 0 < p < 2tt,
and zero otherwise. Therefore, the random variables R and <P are independent. This is a very important feature in mathematical statistics.
10.2.25. Functions of random variables. In practice, ran-^«5, dom vectors are encountered in two quite differ-lft±i| ent roles. Firstly, we can observe several ran-»:r»vl; c'om variar,les which describe less or more re-i^rf%^-J— lated events. As an example, examine various numerical parameters connected to individual students (their results in particular courses, weight, height, age, annual income, etc.). In this case, tools are needed which allow the examination of differences or dependencies between these variables.
We can examine only one of the parameters on a large collection of objects and select only a small number n of them. This procedure is described by an n-dimensional vector (Xi,..., Xn) where all the random variables Xk have the same probability distribution. In this case, we are more interested in the quantities that correspond to statistical numerical characteristics discussed in the previous part of this chapter.
Both cases can be dealt with using one simple concept. Instead of the given random variable or random vector, consider functions of those.
This is a useful tool even in the case of one random variable. As an example, consider the random variable X with uniform distribution over the interval [1,2] c K giving the length of the side of a square, asking for the random variable
Y = X2 describing the area of such a square. The problem is to see the stochastic behaviour of Y in terms of the known parameters of X.
The obvious technical condition on t/j is to guarantee that
Y = i/j o X is again a random variable according to the definition. This means the preimage t/>_1 of a Borel-measurable set should be again a Borel-measurable set.
Elementary arguments reveal that ip~1(A \ B) = ij~\A) \ 4>-\B) and = ^(u^AA, for
subsets A, B, Ai e K. Since each open subset in R™ is a countable union of intervals, and the pre-images 2/>-1(i7) are open for continuous functions t/j and open sets U, the continuous function t/j always satisfies the condition.
695
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.F.4.   A random variable X is denned by its density
fx(x) = cos a; for x G (0, |) and fx(x) = 0 elsewhere.
Find its expected value, variance, and median.
Solution. Using the definition and integration by parts, we
get
EX =
x cos xdx = \x sin a; + cos x]n = — — 1.
Using double integration by parts, we obtain
E(X2
a;2 cos xdx
= I x sin x + 2x cos x — 2 sin a; J Q2 = ^
-2,
so the variance is equal to var X = (§)2 — 2 — (| — l)2 = 7r— 3. By definition, the distribution function is equal to Fx (x) = f* cos tdt = sin a;, and the median is _F-1(0.5) = |. □
10.F.5. A random variable X is defined by its density fx(x) = \e~Xx for x > 0, and fx(x) = 0 elsewhere (the so-called exponential distribution; A > 0 is a fixed parameter). Find its expected value, variance, mode (the real number where the density reaches its maximum), and median. Solution. Using the definition and integration by parts, we get
EX = E(X2) =
a;Ae Xxdx
xz\e~Xxdx =
-Xx
-x2e~Xx
- 2x\e A
Xx
l
if
-Xx
l
Ä'
A2
2
Ä2"'
hence varX = E(X2) - (EX)2 = Since F'(x) = —A2e~Ax < < 0, the density keeps decreasing. Therefore, its maximum is at zero. By definition, we have
F(x) = / Ae" Jo
ldt = l- e~AX, so the median is equal to F-1 (0.5) = — j ln(^)
In 2 A 1
□
10.F.6. The joint probability mass function of a discrete random vector (Xi, X2) is defined by 7r(0, —1) = c, 7r(l, 0) = 7r(l,l) = 7r(2,1) = 2c, 7r(2,0) = 3c and zero elsewhere. Find the parameter c and compute the covariance
cov(X1,X2).
Solution. If 7r is to be a probability mass function, then the sum of its values over the entire domain must be equal to 1, i. e.,
Y      j) = c + 3.2c + 3c = 10c = 1,
In the sequel, we restrict ourselves to this case.
Functions of random variables and vectors
For a continuous function t/j: R —> R and a random variable X, there is also the random variable Y = i/j(X). Y is said to be a function of the random variable X.
In the case of a random vector (Xi,..., Xn) and a continuous function i/j : R™ —> R, we talk about a function Y = ip{Xi, • • •, Xn) of the random vector.
It is useful to know whether independent random variables remain independent after transformations. The answer and its verification are simple:
Proposition. Consider two independent random variables X and Y and two functions g and h such that U = g(X), V = h(X) are random variables. Then U and V are independent, too.
Proof. For fixed reals u, v, write
Au = {x; g(x) < u}      Bv = {y; h(y) < v}. Then the joint distribution function for the vector (U, V) is
Fuy{u, v) = P(U <u,V <v) = P(X e Au, Y e Bv)
= P(X e AU)P{Y e Bv)
= P(U < u)P(V <v)= Fu(u)Fv(v).
Thus, the transformed random variables are stochastically independent, as expected. □
10.2.26. Affine transformations and sums. The simplest function (except constants) is an affine dependency
tf,(X) =a + bX
with constants a, b e R, b ^ 0.
If fxix) is the probability mass function of a random variable with discrete distribution, it is easily computed that
(1)
U(x)(y) = P(4>(x) = y)= E f(x*)-
Thus, in the case of the affine dependency Y = a + bX, the probability mass function is non-zero exactly at the points
yi = axi + b.
As an example of a function of a random vector X = (Xi,..., Xn), consider the sum of n independent random variables with the Bernoulli distribution X{ ~ A(p). Of course, this leads just to the binomial distribution Bi(n,p). The above formula for f^(x) (y) reveals the already known probability function for Y = X\ + ■ ■ ■ + Xn. Only y e {0,..., n) can be reached. Collect all the possibilities of summing y ones, when each of them appears with probability
Pv(i -p)n~y.
Similarly, proceed with continuous random variables. The two parameter family Y = fi + aZ is met already, where
696
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
so c = yq. The probability mass function ir1 of X1 is given by the sum of the joint function over all possible values of
X2, i. e., 7ri(j) = ^2jTr(i,j). Thus, we have 7Ti(0) = c, 7ri(l) = 4c, 7ri(2) = 5c and zero elsewhere. Similarly, for the probability mass function ir2 of X2, we get 7r2(—1) = c, 7r2(0) = 5c, 7r2(l) = 4c and zero elsewhere. Hence, Eli = __\ i7n(i) = 14c = 1.4 and EX2 = jtt2(j) = = 3c = 0.3. By the definition of the covariance, we have
cov(X1,X2) = J2(i - 1, 4)(j - 0, 3)n(i,j) = 0.18.
□
10.F.7. In many scientific fields, the behavior of a random variable which is bounded onto an interval is modeled using the so-called beta-distribution. That is a continuous distribution is denned by its density on the interval [0,1]:
1
fx(x) =
Hi-:
B(q,/3)
where a, j3 are fixed parameters, chosen suitably for description of the given random variable, and B(q, 0) is a normalizing constant, guaranteeing that the integral of fx (x) over [0,1] is equal to zero. Find its a) mode, b) expected value, and c) variance.
Solution, a) By definition, the mode is the value where fx (x) reaches its maximum. Thus, let us look at its stationary points. We can easily calculate that the equation f'x (x) = 0 is equivalent to
(a - 1)(1 -x) -x(/3- 1) = 0,
Since fx (0) MM)
and this is satisfied for x — Q+(3_2 • 0 and the function is positive, it must be the wanted maximum.
b) By definition, we have
1
EX =
»(1 -xf-xdx.
B(q,/3) jo Integrating by parts, we get
1 f1
EI =--—-— [xa(l-x)% + ——— / x^tl-x)
B(M)/3     ^     ' Jo   B(q,/3)/3 Jo        k '
Clearly, the first term is equal to zero. Refining the second
one, we obtain
ri
EI=„, "„,„ /   x"'1^ -xf-^dx-
B(q,/3)/3 Jo
va{\- xf^dx.
B(q,/3)/3 Jo
Now, the integral in the first term is, thanks to the normalization, equal to B (a, 0), and the second integral is the expected
Z ~ N(0,1) in 10.2.21. This is verified easily.
FY(y) = P(Y <y) = P(/i + aZ < y)
y
2'2dz
27to-
dx,
where the substitution x = \i + az is used in the last step. This is exactly what is wanted.
More generally, the above formula (1) has the straightforward analog for the density of Y = tp(X) for a continuous X in the case when t/j has got non-zero derivative (thus t/j is invertible).
(2) My)
M IT"
Check the formula yourself! (Start with the case when the derivative t/j' is always positive.)
It is more complicated with more general sums of independent random variables. Consider two such continuous random variables X and Y with densities fx It and M respectively. The distribution function of the random variable V = X + Y is computed directly (exploit the independence of X and Y write the joint density of (X, Y) as product):
Fv{u) = /        fx(x)fY(y) dxdy =
■j x-\-y<Cu
OO      rU — X
-OO -j — OO
U      / pOO
fx(x)fY(y) dydx fx{x)fy{v - x) dx \ dv,
00 \-j — 00
where the substitution v = x + y is used together with the Fubini theorem. Thus, the joint density of the sum of two independent random variables is just the convolution of their densities
(3)
fv = fx * M
already met in subsection 7.2.2 (page 473). Similarly, there is a discrete convolution of probability mass functions in the case of discrete random variables.
In the seventh chapter, we viewed the convolution as a 3 kjnd of blury picture of one of the functions with the help of the kernel expressed by the other. This should be the right intuition for the density of the sum of independent random variables as well. Of course, this also suggests that the convolution must be symmetric in the arguments.
10.2.27. Numerical characteristics. When examining some values (of a measurement, for example) from the statistical point of view, important | numerical characteristics like the arithmetic mean and the standard deviationare looked for. Now we introduce similar characteristics for random variables and
697
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
value, too. Thus, the above equation can be written as
EX=^--EX
{3 /3
Hence it immediately follows that EI = .
c) In order to compute the variance, we must calculate
1
random vectors. The first one is an analogy of the arithmetic mean.
Expected value
EI
xa+1(l -xf~xdx.
B(q,/3) j0
This integral can be computed similarly as in b)
a + 1
EX =
EX
B(a,f3)f3 J0     v      ' /3 /3
Hence, EI2 = ^a+p+i^ ■ Substituting the expected value.
we obtain
varX
EX2
a/3
(EX)2
□
(q + /3 + 1)(q + /3)2'
10.F.8. We toss three coins. Let X denote the total number of heads on the first and second coins, and Y denote the total number of heads on the second and third coins.
Solution. First of all, we build the table describing the joint probability mass function of the discrete random vector (X, Y), whence we can get the probability distributions of the variables we will need:
Y X	0	1	2
0	l 8	1 8	0
1	1 8	1 4	l 8
2	0	1 8	1 8
The discrete variables X and Y have the same probability distribution: they take on the value 0 with probability 1/4,1 with 1/2, and 2 with 1/4. (Of course, we could have come to this even without the table.) The variable XY takes on the values 0, 1, 2, 4 with probabilities 3/8, 1/4, 1/4, 1/8, respectively. Now, we compute the expected values of the variables X,X2,K Y2, XY:
E(X)    = E(Y)
1        1 1 .0-i + l.-+2.- = l
1
1
1
E(X2)   = E{Y2)   = 0---I-1 - -+4-- =
E(XY) Thus, we have
.0.^ + 1.1+2.1+4.1 = -8        4        4        8 4
a2{X)   = <?{Y)   =E{X2)-[E{X)]2 =
cov(X, Y)
E(XY) - E(X)E(Y) = -
For any random variable X, define its expected value E X by
2~2,\x i fx (x i)      for a discrete variable J™oo xfx (x)dx   for a continuous variable,
provided the sum or integral converges absolutely. If not, the random variable X is said to have no expected value. X2.  The expected value of a random vector is simply the vector of expected values of the individual components.
The expected value can also be expressed directly for functions Y = «/>(X) of a random variable or vector X. Recall that we consider only those functions ip for which Y is again a random variable.
In the discrete case, compute
Ey = 5>p(y = %-)
j
= YJ^{xi)P{X = xi),
i
provided the sum converges absolutely. Of course, it is not guaranteed that a function of a random variable which has expected value also has it.
Similarly, the expected value of a function of a continuous random variable is :
E^(X)
i>(x)fx(x)dx,
provided the integral converges absolutely.
Note that the random variable Y = i/j(X) does not have to be continuous even if the original variable X is. Nevertheless, if ip is a differentiable monotone function with non-zero derivative, it is an easy exercise to verify that the definition of E i/j(X) coincides with E Y. We do not go into further details here.
Shortly, in the part devoted to statistics, it is shown that the expected value has a direct connection with the arithmetic mean of the corresponding vector of values.
10.2.28. St Petersburg paradox. We return to the exam-P^e usec^ as motivation for the need of discrete ^TX^"  random variables in subsection 10.2.1. Refor-'!^Jf^-~--'   mulate the model as potential rules for a casino. \%£^.-.    This results in a good example of a situation where the expected value of the examined random variable does not exist at all.
The gambler pays an initial amount C and then keeps tossing a coin until it comes up heads. Denoting the number of tosses he has made by T, he wins 2T. The problem is to determine a "reasonable value" for the initial amount C. If X
698
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Altogether,
PX,Y
cov(X, Y) _ 1 a(X) ■ a{Y) ~ 2
□
10.F.9. Consider random variables U, V, denned by their joint probability mass function (U takes on 1 or 2, V takes on 1, 2 or 3):
	V
u	1     2 3
1 2	0.1   0.2 0.3 0.2   0.1 0.1
Find the marginal distributions of both variables, their expected values, variances, and correlation coefficient. O
10.F.10. Find the expected value and variance of the random variable X2 provided X has uniform distribution on the interval (-1,1). o
10.F.11. We roll two dice. Let X denote how many times we got an even number, and Y denote how many times we got an odd number. Find their correlation coefficient. O
10.F.12. Consider random variables U, V, denned by their joint probability mass function (U takes on 1 or 2, V takes on 1, 2 or 3):
	V
u	1     2 3
1 2	0.1   0.1 0.4 0.2   0.1 0.1
Find the marginal distributions of both variables, their expected values, variances, and correlation coefficient. O
G. Transformations of random variables
Consider a continuous function of a random variable Y = ip(X). If the transformation t/j is increasing, then the resulting distribution function satisfies
FY(y) = P(Y <y) = P(i,(X)<y) =
= P(X <i,-1(y))=Fx(i,-1(y)),
where Fx is the distribution function of X (analogously for decreasing t/j). Thus, the density of the transformed random variable Y satisfies
f , x     dFY{y) ,    1( djj^jy)
Applying the rule for transformation of coordinates, we can compute the expected value of Y as
/oo />oo yfv(y)dy= ifj(x)fx(x)dx, -oo J — oo
ans similarly for the variance of Y.
is the random variable corresponding to the won amount, it seems that the correct answer is "anything below the expected value EX'.
As derived in 10.2.1, P(T = k) = 2~k, provided that the coin is fair. Sum up all the probabilities multiplied by 2k, to obtain J2T 1 = oo. Therefore, the expected value does not exist. So it seems that it is advantageous for the gambler to play even if the initial amount is very high...
Simulate the game for a while, to obtain that the amount won is somewhere around 24. The reason is that no one is able to play infinitely long, hence the extremely high amounts are not feasible enough to be won, so such amounts cannot be taken seriously. In decision theory, these cases (when the expected value does not directly correspond to the evaluated utility) are called St. Petersburg paradox, and much literature has been devoted to this topic.3
10.2.29. Properties of the expected value. In the case of simple distributions, compute the expected
value directly from the definition.   For instance, for the Bernoulli distribution A(p), it is immediate that
ex = (1 -p) ■ 0+p- 1 =p.
Similarly, compute the expected value np of the binomial distribution Bi(n,p). This requires more thought. The result is a direct corollary of the following general theorem since Bi(n, p) is the sum of n random variables with the Bernoulli distributions A(p).
For any random variables x, Y, real constants a, b, consider the expected values of the functions of random variables x + Y and a + bX, provided the expected values e x and e Y exist.
It follows directly from the definition that the constant random variable a has expected value a. Further,
E(bX) = bEX,
since the constant b can be factored out from the sums or integrals.
More generally, the expected value of the product of independent random variables x a Y can be computed as follows. Suppose the components of the vector (x, Y) are discrete and independent, with probability mass functions fx(xi), fY(yj). Then,
e(xy) =5252xiyjfx(xi)fY(yj)
i j
= (J>M^)) (E%M%)) =eiek
Similarly, verify the equality e(xy) = ex EY for independent continuous random variables.
Going back to Bernoulli, 1738, the real value is given by the utility, rather than the price.
699
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.G.1. Consider a random variable X with density f(x). Find the density of the random variable Y denned by
i) Y = ex,x > 0,
ii) Y = VX, x>0,
iii) Y = \nX,x > 0,
iv) Y = A,x > 0.
Solution. We can simply apply the formula for the density of a transformed random variable, which yields a) jy (y) =
/(lny)i, b) fy(y) = 2/(y2)y, c) fy(y) = /(e»)e», d)
10.G.2. Consider a random variable X which has uniform distribution on the interval (—■§, f )■ Find the density of X as well as the densities of transformed variables Y =
sinX,Z = tgX.
Solution. Since the length of the interval where X is nonzero is TT, the density of X is jx(x) = ^ for x G (—§, f) and zero elsewhere. Applying the formula for the density of a transformed random variable and the derivatives of elementary functions, we get
/y(y) = /x(arcsin(y))arcsin'(y) =
and
fz(y) = /x(arctan(;z)) arctan'(y)
□
1
l
T(i + y2)'
10.G.3. Consider a random variable X whose density is cos a; for x G (0, -|) and zero elsewhere. Find the density of the random variable Y = X2 and calculate E Y and var Y.
Solution. Applying the formula for the density of a transformed random variable, we get
/y(y) = fx(Vy)(Vy)' = 2~jtjcosx'
It is simpler to compute the expected value and variance of Y directly from the density of X.   We have EY =
.C00x2fx(x)dx.Thus,
E Y
x cosxdx = \x sinx + 2xcosx — 2sinx\
The integral was computed by parts. Applying this method again, we obtain
E(Y
2^ —  '   xA cos xdx o
= \(x4 - I2x2 + 24) sin a; + 4(a;3 - 6x) <
Now compute E(X + Y) for arbitrary random variables. For discrete distributions of X and Y,
E(X + Y) = ^2 Y^i^ + Vj)P(X = xi,Y = Vj)
i j
= E(^Ep(x = a:»y = yj.
i     ^ j
E(*E/,:A a
=  x*p(x = xi) + E%p(y =
i 3
where absolute convergence of the first double sum follows from the triangle inequality and the absolute convergence of the sums that stand for the expected values of the particular random variables. Absolute convergence is used in order to interchange the sums.
Dealing with continuous variables X and Y, whose expected values exist, proceed analogously.
/oo poo /    (x + y)fx,Y(x,y)dxdy
oo    /*oo /*oo /*oo
x /  fx,Y(x,y)dydx+     y     fX,y(x,y) dydx
■ OO J — OO J — OOJ—OG
/OO /*oo xfx(x)dx+        yfy(y)dy = EX + EY, -oo J — oo
where absolute convergence of integrals of the expected values E X and E Y is used to interchange the integrals by Fu-bini's theorem.
Altogether, the expected formula:
E(X + Y) = EI + EY
is obtained, whenever the expected values E X and E Y exist.
Straightforward application of this result leads to the following:
Affine nature of expected values
For any constants a,b1,...,bk and random variables
X\,..., Xk,
E(a + biXi H-----h bkXk) = a + bx EXx H-----\-bkEXk.
The following theorem extends this behaviour with respect to affine transformations of random vectors, and shows that the expected value is invariant with respect to affine transformations, as is the arithmetic mean:
T4ieoj;em. Let X = (Xx,..., Xn) be a random vector with expected value EX, a G Rm, B G Mat„jn(R) a martix. Then,
E(a + B-X) = a + B-EX.
Proof. There is almost nothing remaining to be proved. Since the expected value of a vector is denned as the vector of the expected values of the components, it suffices to restrict attention to a single item in E(a + B ■ X). Thus, it can be assumed that a is a scalar and B is a matrix with a single row.
700
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Hence, E(Y2) = (f)4 - 12(f)2 + 24, so varF = £L _ 3^2 + 24 - ^4-i6?2+64 = 20 - 2tt2.
16 16
□
10.G.4. Let X be a random variable which takes on values 0 and 1, each with probability \. Similarly, let Y take on the values —1 and 1, each with probability \. Show that the random variables X aZ = XY are uncorrelated, yet dependent. Give an example of two continuous random variables with this property.
Solution. First of all, we compute the expected values of our random variables: EX = 0-|+l-| = |,EZ = E(XY) = 0 ■ \ + (—1) ■ \ + 1 ■ \ = = 0. As for the expected value of their product, we have E(XZ) = E(X2 Y) == 1 ■ \ + (—1) ■ \ = 0. By theorem 10.2.33, the covariance is equal to cov(X, Z) = 0 - \ ■ 0 = 0. Thus, the variables X and Y are uncorrelated. At the same time, the conditional probability P(Z = 1\X = 0) is clearly zero, i. e., we also have P(Z = 1,X = 0) = 0, while P(Z = 1) = \ and P(X = 0) = \, so P(Z = 1) ■ P(X = 0) = | ^ 0. We can see that P(Z = 1) ■ P(X = 0) ^ P(Z = 1, X = 0), which means that X and Z are dependent.
It can be easily verified from the corresponding definitions that if X is any random variable with zero expected value, finite second moment and zero third moment, then X and Y = X2 are dependent, but uncorrelated. □
H. Inequalities and limit theorems
Markov's inequality provides a rough estimate of the behavior of a non-negative random variable if we know nothing more than its expected value. In exact words, for any non-negative random variable X and any a > 0, it holds that
P(X > a) <
10.H.1. Consider a non-negative random variable X with expected value p. With no further information about X, bound P(X > 3/i). Then, compute P(X > 3/i) if you know thatX-Ex(i).
Solution. If the non-negative random variable X does not take zero with probability 1, then its expected value p is positive. Therefore, the wanted probability can be bounded using Markov's inequality as
Then, the expected value of a finite sum of random variables is obtained, and by the above results, that exists and is given as the sum of the expected values of the individual items. This is exactly what is wanted to be proved. □
10.2.30. Quantiles and critical values. Introduce numeri-jksu. cal characteristics that are analogous to those from descriptive statistics. There, the next use-'"$^jpV% ml characteristics are the quantiles, cf. 10.1.5. 4ferS^-J— Consider a random variable X whose distribution function Fx is strictly monotone. This is satisfied by any random variable whose density is nowhere equal to zero, which is the case for the normal distribution, for example. In this case, define the quantile function F^1 simply as the inverse function (Fx)'1 : (0,1) -> R. This means that the value y = F~1(a) is such that P(X < y) = a. This corresponds precisely to the quantiles from descriptive statistics using relative frequencies for the probabilities.
Quantile function
For any random variable X with distribution function Fx (x), define its quantile function
F~1(a) = infix e R; F(x) > a}, a e (0,1).
Clearly, this is a generalization of the previous definition in the case the distribution function is strictly monotone.
As seen in descriptive statistics, the most used quantiles are for a = 0.5 (the median), a = 0.25 (the first quartile), a = 0.75 (the third quartile). Similarly for deciles and percentiles when a is equal to (integer) multiples of tenths and hundredths, respectively.
It follows directly from the definition that the quantile function for a given random variable X allows the determination of intervals into which the values of X fall with a chosen probability. For instance, the value (0.975), approximately 1.96, corresponds to percentile 97.5 for the normal distribution N(0,1). This says that with the probability of 2.5 %, the value of such a random variable Z ~ N(0,1) is at least 1.96. Since the density of the variable Z is symmetric with respect to the origin, this observation can be interpreted as that there is only a 5% probability that the value of \Z\ is greater 1.96.
There are similar intervals and values when discussing the reliability of estimates of characteristics of random variables.
Critical values
For a random variable X and a real number 0 < a < 1, define its critical value x(a) at level a as
P(X > x(a)) = a.
This means that x(a) = F^(l — a) where F^1 is the quantile function of the random variable X.
701
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
If we know that I ~ Ex(-), then
P(X > 3/i) = 1 - P(X < 3/i) = 1 - F(3n),
where F is the distribution function of the exponential distribution. By definition, this is
r i
F(x) = /   -e ~dt ■ Jo I1
Hence, P(X > 3/i) = 4r.
-e c I   = 1 — e * . Jo
□
P(X < 60) = 1 - P(X > 60) > 1
10.H.2. At a particular place, the average speed of wind is 20 kilometers per hour.
• Regardless of the distribution of the speed as a random variable, bound the probability that in a given observation, the speed does not exceed 60 km/h.
• Find the interval in which the speed lies with probability at least 0.9 if you know that the standard deviation is a = 1 km/h.
Solution. Let X denote the random variable that corresponds to the speed. In the first case, we can only use Markov's inequality, leading to
20 _ 2 60 ~ 3'
In the second case, we know the variance (or standard deviation) of the speed, so we can use Chebyshev's inequality (see 10.2.32):
0.9 < P(\X - 201 < x) = 1 - P(\X - 201 > x) < 1 - -4-
xl
Hence, x > v^lO « 3.2. Thus, the wanted interval is (16.8 km/h, 23.2 km/h). □
10.H.3. Each yogurt of an undisclosed company contains a photo of one of 26 ice-hockey world champions. Suppose the players are distributed uniformly at random. How many yogurts must Vera buy if she wants the probability of getting at least 5 photos of Jaromir Jagr to be at least 0.95?
Solution. Let X denote the random variable that corresponds to the number of obtained photos of Jaromir Jagr (parametrized by the number n of yogurts bought). Clearly, X ~ Bi(n, A,). We are looking for the value of n for which P(X > 5) = 0.95, i. e., Fx (A) = P(X < 4) = 0.05. In order to find it, we use the de Moivre-Laplace theorem and approximate the binomial distribution with the normal distribution (we assume that n is large, so the approximation error will be small). By F, the expected value of X is E X = and its variance is var X = ||f. Denoting the corresponding
10.2.31. Variance and standard deviation. The simple nu-■ merical characteristics concerning the variabil-
f&JtW ity °f sample values in descriptive statistics sr=2g5=-^ were the variance and the standard deviation. Define them similarly for random variables.
Variance of a random variable
Given a random variable X with finite expected value, its variance is denned as
var A = E ((A -EI)2) ,
provided the right-hand expected value exists. Otherwise, the variance of A does not exist.
The square root Vvar A of the variance is called the standard deviation of the random variable X.
Using the properties of the expected value, a simpler formula can be derived for the variance of a random variable A whose expected value exists:
var A = E(A — E A)2 = E(A2 - 2A(E A) + (E A)2) = EA2 - 2(EA)2 + (EI)2 = EA2 - (EI)2.
Consider how affine transformations change the variance of a random variable. Given real numbers a, b and a random variable A with expected value and variance, consider the random variable Y = a + bX. Compute
varF = E((a + bX) -E(a + bX)f = E(b(X — EI))2 = &2varl.
Thus are derived the following useful formulae: Properties of variance
(1)
(2)
(3)
var I var (a + bX)
E(X2) - (EX)2 b2v&rX
\Jvar(a + bX) = b\/var I
Given a random variable I with expected value and nonzero variance, define its standardization as the random variable
„ I-EI
Vvar I
Thus, the standardized variable is the affine transformation of the original variable whose expected value equals zero and variance equals one.
10.2.32. Chebyshev's inequality. A good illustration of the usefulness of variance is the Chebyshev's in-■~^// equality. This connects the variance directly £-r to the probability that the random variable assumes values that are distant from its expected value.
702
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
standardized variable by Z, we can reformulate the condition
as
0
05 = P(X<4) = P Z<-g^
26
■Fz
104 -n 5^n
where by the approximation assumption, Fz ~ $ is the distribution function of the normal distribution N(0,1). Since we must have n > 104, using <P(—x) = 1 — <F(x), the above equation gives n — 104 = <P_1 (0.95) ■ Using a table of the normal distribution or appropriate software, we can learn that 2(0.95) = 1.65. Solving this quadratic equation, we get n = 228.8. Thus, Vera must buy at least 229 yogurts. □
10.H.4. We roll a die 1200 times. Find the probability that the number of 6s lies between 150 and 250 (inclusive) using Chebyshev's inequality, and then using Moivre-Laplace theorem.
Solution. Let X denote the random variable which corresponds to the number of 6s. Clearly, X ~ Bi(1200, |).
Chebyshev's inequality
Theorem. Consider a random variable X with finite variance, and fix an arbitrary e > 0. Then,
P{\X-EX\>e)<V-^.
Proof. Suppose X is continuous. Set fi = EX and compute, using the definition:
var X
(x — jS)2 f(x) dx (x — jS)2 f(x) dx
+ J (x-ii)2f(x)dx
J\x—Li\<e
>l e2f(x)dx = e2P(\X-ß\>e).
\x—ß\>e
□
By F, we have E X
1200
200 and varX
200(1 - g) = The condition on the number of 6s says that 150 < X < 250, which can be written as |X-200| < 50. Using Chebyshev's inequality 10.2.32, we get
P(|X-200| < 50) = 1-P(|X-200| > 51) > 1--«
3 ■ 512
(2) The exact value of the wanted probability is given by the expression
P(150 < X < 250) = Fx(250) - Fx(150),
where Fx is the distribution function of the binomial distribution. By definition,
mar a
P(150 < X < 250) =
fc=150
This expression is hard to evaluate without a computer, so we use Moivre-Laplace theorem. Replacing X with the standardized random variable
V3(X - 200)
The analogous proof for discrete random variables is left as an exercise for the reader.
Realizing that the variance is the square of the standard deviation a, the choice e = ka yields the probability
P(\X -EX\ > ka) < -3-.
k
0 94
The Chebyshev's inequality helps understanding asymptotic descriptions of limit processes. For instance, consider the sequence of random variables X±,X2, ■ ■ ■ with probability distributions Xn ~ Bi(n,p), with a fixed value of p, 0 < p < 1. Intuitively, it is expected that the relative frequency of success should approach the probability p as n increases, i.e., that the values of the random variables Yn = ij„ should approach p. Clearly,
EYn = ^=p,    Varr„ = IM^ = _(i^).
n nA n
Direct application of Chebyshev's inequality yields, for any fixed e > 0, that
^(I^n-Pl >e) <
p(l - p)
Z =
iov^
then, by 10.2.40, we have Z ~ 7Y(0,1), i. e., Fz « Thus,
P(150 < X < 250) = P(v/5(2i^200)
<P(VTE) - <P(-Vl5) = 2<P(Vl5) - 1.
< r, < y/3(150-200) ■,
ioVE >
We learn that <P( V15) « 0.99994, so the wanted probability is approximately 99.988 %. □
Hence it is clear that, for any fixed e > 0,
lim P(\ — -p\>e)=0.
n—too       1   71 1
This result is known as Bernoulli's theorem (one of many).
This type of limit behaviour is called convergence in probability. Thus it is proved (as a corollary of Chebyshev's inequality) that the random variables Yn converge in probability to the constant random variable p.
703
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.H.5. At the Faculty of Informatics, 10 % of students have prumer less than 1.2 (let us call them successful). How many students must we meet if the probability that there are 8-12 % successful ones among them is to be at least 0.95? Solve this problem using Chebyshev's inequality, and then using Moivre-Laplace theorem.
Solution. Let X denote the random variable that corresponds to the number of successful students, parametrized by the number n of students we meet. Since a randomly met student has probability 10 % of being successful, when meeting n students, we have X ~ Bi(n, j^). By F, we have EI = O.ln and varX = 0.09n. By Chebyshev's inequality 10.2.32, the wanted probability satisfies
P(\X - 0.1n| < 0.02n) = 1 - P(\X - 0.1n| > 0.02n) > >      0.1 ■ 0.9n _ 225 ~   ~ (0.02n)2 ~   ~ ~n'
The inequality 1 - ^ > 0.95 and hence
P(\X - 0.1n| < 0.02n) > 0.95
holds for n > 4500. The exact value of the probability is given in terms of the distribution function Fx of the binomial distribution:
P(0.08n < X < 0.12n) = Fx(0.12n) - Fx(0.08n).
Using the de Moivre-Laplace theorem (see 10.2.40), we can approximate the standardized random variable Z = wiXynn with the standard normal distribution, Fz ~ <t>, so
0.95   =   P(0.08n < X < 0.12n) = P(-J^ < Z < ^)
«   $(-!—)-$(--!—) =
y 15 '      K \h'
= 2*(^)-l. v 15 '
Hence wn = 15z(0.975) and we learn n « 864.4. Thus, we can see that it is sufficient to meet 865 students. □
10.H.6. The probability that a planted tree will grow is 0.8. What is the probability that out of 500 planted trees, at least 380 trees will grow?
Solution. The random variable X that corresponds to the number of trees that will grow has binomial distribution X ~
Bi(500, |). By F, we have EX = 400 and varX = 80.
10.2.33. Covariance. We return to random vectors. In the ^ case °f me expected value, the situation is very simple —just take the vector of expected values. When characterizing the variability, the dependencies between the individual components are also of much interest. We follow the idea from 10.1.9 again.
Covariance
Given random variables X, Y whose variances exist, Define their covariance as
cov(X,r) =E((X-EX)(Y-EY))
The basic properties of the concept can be derived very easily:
Theorem. For any random variables X Y, Z whose variances exist and real numbers a,b,c, d,
(1) cov(X, Y) = cov(Y X)
(2) cov(X, Y) = E(XY) - (E X) (E Y)
(3) cov(X+ YZ) =cov(X,Z)+cov(YZ)
(4) cov(a + bX, c + dY) = bd cov(X, Y)
(5) var(X + Y) = varX + var Y + 2 cov(X, Y).
Moreover, if X and Y are independent, then cov(X, Y) = 0, and consequently
(6)
var(X + Y) = var X + var Y.
The standardized random variable is Z
X-400
By the de
Proof. Directly from the definition, the covariance is symmetric in the arguments. The second proposition follows immediately from the properties of the expected value:
~cov(X,r) = E(X -EX)(Y -EY)
= E(XY) - (EY)X- (EX)Y + EXEY = E(XY) —EXEY.
The next proposition also follows easily if the definition is expanded and the fact that the expected value of the sum of random variables equals the sum of their expected values is used.
The next proposition can be computed directly:
cov(a + bX, c + dY) =
= E((a + bX - E(a + bX))(c+ dY - E(c + dY)))
= E((bX - bE(X))(dY - dE(Y)))
= E(bd(X-E(X))(Y-E(Y)))
= bdE((X-EX)(Y-EY)) = bdcov(X,Y).
704
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Moivre-Laplace theorem, we have Fz ~ $, so
P(X > 380)   =   P(Z >
380 - 400,
80
v     2 '
20,
□
10.H.7. Using the distribution function of the standard normal distribution, find the probability that the absolute difference between the heads and the tails in 1600 tosses of a coin is at least 82.
Solution. Let X denote the random variable that corresponds to the number of times the coin came up heads. Then X has binomial distribution Si (1600,1/2) (with expected value 800 and standard deviation 20), so for a large value of n = 1600, by the de Moivre-Laplace theorem, the distribution function of the variable x~2q00 can be approximated with the distribution function <P of the standard normal distribution. Thus, the wanted probability is
1 -P[759 < X < 841 1 -P
X - 800 2.05 < ——— < 2.05
20
2<2>(-2,05) = 0.0404.
□
10.H.8. Using the distribution function of the standard normal distribution, find the probability that the absolute difference between the heads and the tails in 3600 tosses of a coin is at most 66.
Solution. Let X denote the random variable that corresponds to the number of times the coin came up heads. Then X has binomial distribution Bi(3Q00,1/2) (with expected value 1800 and standard deviation 30), so for a large value of n = 3600,
The other propositions about the variance are quite simple corollaries:
var(X + Y)= E((X + Y)- E(X + Y)f = E((X - EX) + (Y - EY)f = E(X - EX)2 + 2E(X -EX){Y -EY)
+ E(Y — E Y)2 = varX + 2 cov(X, Y) + varY
Furthermore, if X and Y are independent, then E(XY) = E X E Y, and hence that their covariance is zero. □
Directly from the definition,
var(X) = cov(X,X).
The latter theorem claims that covariance is a symmetric bilinear form on the real vector space of random variables whose variance exists. The variance is the corresponding quadratic form. The covariance can be computed from the variance of the particular random variables and of their sum, as seen in linear algebra, see the property (5).
Notice that the random variable, equal to the sum of n independent and identically distributed random variables Y{ behaves, very much differently than the multiple nY. In fact,
var(Yi + ■ ■ ■ + Yn) = n var Y,    var(nY) = n2 var Y.
10.2.34. Correlation of random variables. To a certain ex-^ tent, covariance corresponds to dependency between the random variables. Its relative version is called the correlation of random variables and, similarly as for the standard deviation, the following concept is denned:
Correlation coefficient
The correlation coefficient of random variables X and Y whose variances are finite and non-zero is denned as
cov(X, Y)
PX,Y
yVar XV var Y
As seen from theorem 10.2.33, the correlation coefficient of random variables equals the covariance of the standardized variables -±= (X - E X) and -A= (Y-EY).
VvarX v ! vvarY v !
The following equalities hold (here, a,b,c,d are real constants, bd ^ 0, and X, Y are random variables with finite
X-800 20
the distribution function of the variable imated, by the de Moivre-Laplace theorem, with the distribution function <P of the standard normal distribution. Thus, the wanted probability is
can be approx-   non-zero variances)
Pfl767 < X < 18331
X - 1800 -^^jo-^1'1
<2>(1,1) -<2>(-l, 1) = 0,7498.
□
Pa+tx,c+dY = sgn(bd)px,Y Px,x = 1-
Moreover, if X and Y are independent, then px,Y = 0.
Note that if the variance of a random variable X is zero, =then it assumes the value E X with probability 1. If the value of X falls into an interval I not containing E X with probability p / 0, then the expression var X = E(X — E X)2 is positive. Stochastically, random variables with zero variance behave as constants.
705
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.H.9. The probability that a seed will grow is 0.9. How many seeds must we plant if we require that with probability at least 0.995, the relative number of grown items differs from 0.9 by at most 0.034.
Solution. The random variable A that corresponds to the number of grown seeds, out of n planted ones, has binomial distribution A ~ Bi(n, ^). By F, we have E A = 0.9n and var A = 0.09n, so the standardized variable is Z = xZ°:®n .
V0.09n
The condition in question can be written as
P(\X - 0.9n\ < 0.034n) = P (\Z\ < °'°34n
V0.09n
/        0 34 \ = Pi\Z\< — v^J > 0.995.
By the de Moivre-Laplace theorem, for large n, the distribution function can be approximated by the distribution function <P of the normal distribution. Thus,
0.34
3 Al-#(~vfc
^ ,'0.34 2á>( — V«
Altogether, we get the condition
^ f'0.34 r-
1 > 0.995.
Odtud vypočítáme n >
3z(0.9975)V 0,34 J
615.
□
10.H.10. The service life (in hours) of a certain kind of gadget has exponential distribution with parameter A = jq . Using the central limit theorem, bound the probability that the total service life of 100 such gadgets lies between 900 and 1050 hours.
Solution. In exercise 10.F.5, we computed that the expected value and variance of a random variable Xi with exponential distribution are equal to E X{ = j and var Aj = j?, respectively. Thus, the expected service life of each gadget is EAj = [i = 10 hours, with variance var Aj = a2 = 100 hours2. By the central limit theorem, the distribution
(xx-ji}
If the covariance is a positive-definite symmetric bilinear form, then it would follow from the Cauchy-Schwarz inequality (see 3.4.3) that
(1)
\px,y \ < 1
The following theorem claims more. It shows that the full correlation or anti-correlation, i.e. px,y = ±1 of random variables X &Y says that they are bound by an affine relation Y = kX + c, where the sign of k corresponds to the sign in px,y = ±1- On the other hand, a zero correlation coefficient says that the (potential) dependency between the variables is very far from any affine relation of the mentioned type. (Note, however, this does not mean that the variables must be independent).
For instance, consider random variables Z ~ N(0,1) andZ2. Then cov(Z, Z2) = EZ3 = 0 since the density of Z is an even function. Thus the expected value of an odd power of Z is zero, if it exists.
Theorem. If the correlation coefficient is defined, then \px,y\ < 1- Equality holds if and only if there are constants k, c such that P(Y = kX + c) = 1.
Proof. A stochastic affine relation between Y and A with nonzero coefficient at Y is sought. This is equivalent to Y + sX ~ D(c) for some fixed value of the parameter s and constant c. In such a case the variance vanishes. Thus one considers the following non-negative quadratic expression:
,'Y — ~EY      X — ~EX\ , 0 < var (     , + t —, 1=1 + 2tpx,Y + t
V varY
V varX
of the transformed random variable -4= y^L, ( —'— itjo -^i — 10 approaches the standard normal distribution as n tends to infinity. Thus, the wanted probability for the service life of 100 gadgets
/ 100
P(900 < ^ Xt < 1050) = P I -1 < — ^ Xt - 10 < 0
\ 2 = 1
can be approximated with the distribution function of the normal distribution:
The right-hand quadratic expression does not have two distinct real roots; hence its discriminant cannot be positive. So Mpx,y)2 — 4 < 0. Hence the desired inequality is obtained, and also the discriminant vanishes if px,y = ±1- For the only (double) root to, the corresponding random variable has zero variance; thus it asumes a fixed value with probability 1. This yields the affine relation as expected. □
10.2.35. Covariance matrix. The variability of a random 1+ ,, vector must be considered. This suggests considering the covariances of all pairs of components. The fol-lowing definition and theorem show that this leads to + 5 1 an analogy of the variance for vectors, including the behaviour of the variance under affine transformations of the random variables.
706
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
P(900 <    xt < 1050) « $(0.5) - <?(-!) « 0.533.
covariance matrix
□
10.H.11. We keep putting items into a chest. The expected mass of an item is 3 kg and the standard deviation is 0.8 kg. What is the maximum number of items that we can put into the chest so that with probability at least 99%, the total mass does not exceed one ton?
Solution. Let X{ denote the random variable that corresponds to the mass of the i-th item. Then, we have fi =
EX{ = 3 and a = yVar X{ = 0.8 (in kilograms), and we want to have
P(J2X, < 1000) = 0.99.
By the central limit theorem 10.2.40, the distribution of the random variable
'Xi-3\ 1     a„ 3y^
1
1
0.8
V   0.8   I 0.8Jn^~
can be approximated by the standard normal distribution. Thus, we get
„/V^ ,, *, 1000    3Jn~, ^,1000
p(£Xi < iooo) = P(sn < ^--fc) - *(^-
i=l v v
We learn that 2(0.99) ~ 2.326, so the wanted n satisfies the quadratic equation
1000 V« 0.8^ ~ ~~0~8
whence we get n « 322. □
= 2.326,
I. Testing samples from the normal distribution
In subsection 10.3.4, we introduced the so-called two-sided interval estimate of an unknown parameter fi of the normal distribution N(/j, a2). In some cases, we may be interested only in an upper or lower estimate, i.e. a statistic U or L for which P(p < U) or P(L < fi), respectively. Then, we talk about a one-sided confidence interval (—oo, U) or (L, oo). The formula for these intervals can be derived similarly as for the two-sided interval. Now, we have for the random variable Z = V^^f^ ~ N(0,1) that
1 - a = $(z(l - a)) = P(Z < z(l - a)).
Hence it immediately follows that
1 - a = P(X - -^=z(l - a) < /i),
Consider a random vector X = (X1,... ,Xn)T all of whose components have finite variances.
The covariance matrix of the random vector X is defined in terms of the expected value as (notice the vector X is viewed as a column of random variables now)
varX = E(X - EX)(X - EX)T.
Using the definition of the expected value of a vector and expanding the matrix multiplication, it is immediate that the covariance matrix var X is the symmetric matrix
/    varXi       cov(Xi,X2)   ■■■ cov(Xi,Xn)\ cov(X2,X1)       varX2       ■■■ cov(X2,X„)
\ cov (Xn,X{)   cov (Xn,X2)
var Xn J
Theorem. Consider a random vector X = (Xi,..., Xn)T all of whose components have finite variances. Further, consider the transformed random vector Y = BX + c, where B is an m-by-n matrix of real constants and c G Rm is a vector of constants. Then,
var(r) = var(BX + c) = B(var X)BT.
Proof. The claim follows from direct computation, us-jn^fhe properties of the expected value:
0-8 var(r) = E((BX + c) - E(BX + c)) ((BX + c)
-E(BX + c))T
= E(B(X -EX))(B(X -EX))T
= BE(X — EX)(X — E X)TBT
= B(v&rX)BT.
The constant part of the transformation has no impact, while with respect to the linear part of the transformation, the covariance matrix behaves as the matrix of a quadratic form.
10.2.36. Moments and moment function. The expected value and variance reflect the square of the deviation of values of a random variable from the average. In descriptive statistics, one also examines the skewness of the data, and it is natural to examine the variability of random variables in terms of higher powers of the given random variable X.
The characteristic E(Xk) is called the k-th moment; the characteristic [ik = E ((X — EX)k) is called the k-th central moment of a random variable X. What also comes in handy is the k-th absolute moment, given by E \ X\k.
From the definition it follows that for a continuous random variable X,
EXk
xkfx(x)dx.
707
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
so L = X - -£jz(1 - a). Similarly, we find U = X + -j%z(l — a), and for a distribution with unknown variance, H>X- - a) and ji < X + ^f„-i(l - a).
If we want to estimate the variance a2 of a random distribution, then we use theorem 10.3.3, similarly as when we derived it for the expected value. This time, we use the second part of the theorem, by which the random variable S2 has distribution x2 ■ Then, we can immediately see that
71—1
1 - a = P ( xl-Mß) < ~^rS2 < x2n-i(l - a/2)
Thus, the two-sided 100(1 — a)% confidence interval for the variance is
(n-l)S2 (n-l)S2
X2_i(l-"/2)'xLiW2)
and similarly for the one-sided upper and lower estimates, we get
2 ^ (n-l)S2             (n-l)S2    ^ 2 o-2 < —f—, resp. 4——^-r < a2.
10.1.1. We roll a die 600 times, obtaining only 45 sixes. Is it possible to say that the die is ideal at level a = 0.01?
Solution. For an ideal die, the probability of rolling a six is always p= \- The number of sixes in 600 rolls is given by a random variable X with binomial distribution X ~ Bi(600, |). By 10.2.40, this distribution can be approximated by the distribution N(100, ^p). The measured value X = 45 can be considered a random sample consisting of one item. Assuming that the variance is known and applying 10.3.4, we get that the 99% (two-sided) confidence interval for the expected
value n equals (45 - ^J^fz(0.995), 45 + ^/2|22(0.995)). We learn that the quantile is approximately 2(0.995) « 2.58, which gives the interval (21,69). However, for an ideal die, we clearly have fi = 100, so our die is not ideal at level a = 0.01. □
10.1.2. Suppose the height of 10-years-old boys has normal distribution N(/i, a2) with unknown expected value fi and variance a2 = 39.112. Taking the height of 15 boys, we get the sample mean X = 139.13. Find
i) the 99% two-sided confidence interval for the parameter
ii) the lower estimate for fi at significance level 95 %.
Similarly, for a discrete random variable X whose probability is concentrated into points x{,
EXk =YJ^kfx(xi).
The next theorem shows that all the moments completely describe the distribution of the random variable, as a rule.
For the sake of computations, it is advantageous to work with a power series in which the moments appear in the coefficients. Since the coefficients of the Taylor series of a function at a given point can be obtained using differentiation, it is easy to guess the right choice of such a function:
Moment generating function
Given a random variable X, consider the function Mx (t) R -4- R defined by
Mx(t) = Ee
tx
J2ietx' fx(xi)      if X is discrete I^oo e*x fx (x) dx   if X is continuous.
If this expected value exists, the moment generating function of the random variable X can be discussed.
It is clear that this function Mx (t) is always analytic in the case of discrete random variables with finitely many val-
Theorem. Let X be a random variable such that its analytic moment generating function on an interval (—a, a) exists. Then, Mx (t) is given on this interval by the absolutely convergent series
00 fk
Mx(*) = £]fe!E**
k=0
If two random variables X and Y share their moment generating functions over a nontrivial interval (—a, a), then their distribution functions coincide.
Proof. The verification of the first statement is a simple exercise on the techniques of differential and integral calculus. In the case of discrete variables, there are either finite sums or absolutely and uniformly converging series. In the case of continuous variables, there are absolutely converging integrals. Thus, the limit process and the differentiation can
be interchanged. Since -| e1
d tx
: etx, it is immediate that
dk
d¥
Mx(t)=EXk,
as expected.
The second claim is obvious for two discrete variables X and Y with only a finite number of values x1,... ,xk for which either fx(xi) =^ 0 or fvixi) =^ 0. Indeed, the functions etx' are linearly independent functions and thus their coefficients in the common moment function
M(t) = etel f(Xi) + ■■■ + etx« f(xk)
must be the shared probability function values for both random variables X and Y.
708
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Solution, a) By 10.3.4, the 100(1 - a)% two-sided confidence interval for the unknown expected value \i of the normal distribution is
(l) fielx
-,z(l - a/2),X + —j=z{\ - a/2)
where X is the sample mean of n items, a2 is the known variance, and z(l — a/2) is the corresponding quantile. Substituting the given values n = 15, a « 6.254 and the learned z(0.995) « 2.576, we get -^z(a/2) « 4.16, i. e., (i e (134.97,143.29).
b) The lower estimate L for the parameter \i at significance level 95 % is given by the expression L = X — ^«(0.95). We learn that 2(0.95) ~ 1.645, and direct substitution leads
to fi e (136.474, oo). □
10.1.3. A customer tests the quality of bought products by examining 21 randomly chosen ones. He will accept the delivery if the sample standard deviation does not exceed 0.2 mm. We know that the pursued property of the products has normal distribution of the form N(10 mm; 0.0734 mm2). Using statistical tables, find the probability that the delivery will be accepted. How does the answer change if the customer, in order to save expenses, tests only 4 products?
Solution. The problem asks for the probability P(S < 0.2). By theorem 10.3.3, when sampling n products, the random variable zprSa has distribution Xn-i- m our case> n — 21 and a2 = 0.0734, so
In the case of continuous variables X and Y sharing their j.',, generating function M(t), the argument is more involved and an indication only is provided. Notice that | M(t) is analytic and thus it is denned for all complex numbers t, \t\ < a. In particular,
M(it)
' f(x)dx,
which is the inverse Fourier transform of f(x), up to the constant multiple \[2~ti, see 7.2.5 (on page 475). If this works for all t, then clearly / is obtained by the Fourier transform of (v/27r)~1M(ii) and thus must be the same for both X and Y. Further details, in particular covering general random variables, would need much more input from measure theory and Fourier analysis, and thus it is not provided here. □
It can be also shown that the assumptions of the theorem are true whenever both Mx (—a) < oo and Mx (a) < oo.
10.2.37. Properties of moment function. By the properties of the exponential functions, it is easy to compute the behaviour of the moment function un-fi-Ofe   der affine transformations and sums of independent random variables.
Proposition. Let a, b G R and X, Y be independent random variables with moment generating functions Mx (t) and My{t), respectively. Then, the moment generating functions of the random variables V = a + bX and W = X + Y are
Ma+bx(t)=eatMx(bt) Mx+Y (t) = Mx(t)MY(t)
Proof. The first formula can be computed directly from the definition:
P(S < 0.2) = P
20
0.0734
■S2 <
20
0.0734
0-22   = xlo
20-0.22\Mv(t) =Ee(°+^' =Eeate^x =eatMx(bt). 0.0734 J As for the second formula, recall that etx and etY
are
The expression in the argument of the distribution function is approximately 10.9, and we can learn from the table of the x2 distribution that xlo(10-9) ~ °-05- Th^- the probability that delivery will be accepted is only 5 %. We could have expected the probability to be low: indeed, E S2 = = a2 = 0.0734 > 0.22. If the customer tests only 4 products, then the probability of acceptance is given by the expression xl
independent variables. Use the fact that the expected value of the product of independent random variables equals the product of the expected values.
.Eet(x+y)
Mw(t)
EetxetY
.EetxEetY
Mx(t)MY(t).
□
0.0734 2
x|(1.63). The value of the distribution function of x2 m this argument cannot be found in most tables. Therefore, we estimate it using linear interpolation. For instance, if the nearest known points are x\ (0.58) = 0.1 and x|(6.25) = 0.9, then
xl(1.63) « (1.63 - 0.58)
0.9-0.1 6.25 - 0.58
+ 0.1 «i 0.24.
10.2.38. Normal and binomial distributions. As an illustrating example, compute the moment function of two random variables X ~ N(/i, a) and X ~ Bi(n,p).
Moment generating function for N(/i, a)
Proposition. If X ~ N(/i, a), then
Mx(t) =e^e^. In particular, it is an analytic function on all o/R.
709
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Although this results is only an estimate, we can be sure that the probability of acceptance is much greater than when testing 21 products. □
10.1.4. From a population with distribution N(p,a2), where a2 = 0.06, we have sampled the values 1.3; 1.8; 1.4; 1.2; 0.9; 1.5; 1.7. Find the two-sided 95% confidence interval for the unknown expected value. Solution. We have a random sample of size n = 7 from the normal distribution with known variance a2 = 0.06. The sample mean is
X = ^(1.3 + 1.8 + 1.4 + 1.2 + 0.9 + 1.5 + 1.7) = 1.4
and we can learn for the given confidence level a = 0.05 that z(l - a/2) = «(0.975) « 1.96. Substituting into (1), we immediately obtain the wanted interval (1.22,1.58). □
10.1.5. Let X1,..., Xn be a random sample from the distribution N(p, 0.04). Find the least number of measurements that are necessary so that the length of the 95% confidence interval for p would not exceed 0.16.
Solution. Since we have a normal distribution with known variance, we know from (1) that the length of the (1 — a)% confidence interval is ^=z(l — a/2). Substituting the given values, we get that the number n of measurements satisfies the inequality
2-02
—=-z(0.975) < 0.16.
yn
Since z(0.975) « 1.96, we obtain n > 24.01. Thus, at least 25 measurements are necessary. □
10.1.6. Consider a random variable X with distribution N(fi,a2), where p, a2 are unknown. The following table shows the frequencies of individual values of this random variable:
X	8	11	12	14	15	16	17	18	20	21
rii	1	2	3	4	7	5	4	3	2	1
Calculate the sample mean, sample variance, sample standard deviation, and find the 99% confidence interval for the expected value [i.
Solution. The sample mean is given by the expression X = niXi/ J2ni- Substituting the given values, we get X = 490/32 « 15.3. By definition, the sample variance is S = J2ni(Xi — X)2/(J2ni ~ !)■ Substituting the given values, we get S2 = 1943/256 « 7.6, so the sample standard deviation is S ~ 2.8. The formula for the two-sided (1 — a)% confidence interval for the expected value p, when the variance
Proof. Suppose Z ~ N(0,1)). Then „ 1
Mz(t)
_e   2 da;
. 2tt
1 e-^-2tx+t2-t2)Ax
2tt
1       <*-')2 , e     2 da
2tt
=   e 2
where use is made of the fact that in the last-but-one expression, for every fixed t, the density of a continuous random variable is integrated; hence this integral equals one.
Substitute the formula for the moment generating function M^+az, to obtain for X ~ N(/i, a) that
Mx(t) =^t&^r-again a function analytic over entire R.
□
In particular, the moments of Z of all orders exist. Substitute \t2 into the power series for the exponential function, and calculate them all:
2 \ k oo
lit
k=0       v     ' k=0
= V — t2k =
= l + 0f + if2 + 0f3 + ^f4 + ... 2 4!
In particular, the expected value of Z is E Z = 0, and its variance is varZ = EZ2 - (EZ)2 = 1. Further, all moments of odd orders vanish, EZ4 =3, etc.
Hence the sum of independent normal distributions X ~ N(/i, a) and Y ~ N(/i', a') has again the normal distribution X + Y ~N(fi + fi',a + a').
Similarly, considering the discrete random variable X ~ Bi(n,p),
Mx(t) =Ee« = E(pe*)fc(T)(l -*>)""*
k=0 ^ '
= (pet+(l-p))" = (p(et-l) + l)n = 1 + npt + ^ (n(n - l)p2 + np)t2 + ...
is computed. Of course, the same can be computed even easier using the proposition 10.2.37 since X is the sum of n independent variables Y ~ A(p) with the Bernoulli distribution. Therefore,
Eetx = (Eety)n = (pe* +(1 - p))n.
Hence all the moments of the variable Y equal p. Therefore, EY = p, while var Y = p(l —p). From the moment function Mx(t), EX = npandvarX = EX2 - (EX)2 = np(l -
Pi-
no
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
is unknown, was derived at the end of subsection 10.3.4:
fie(x- -J=*n-i(l - a/2),X + -J=*n-i(l - a/2)
Substitution yields X = 15.3, n = 32, S « 2.8, a = 0.01, and we learn %(0.995) « 2.75. Thus, the 99% confidence interval is \i e (14.0,16.7). □
10.1.7. Using the following table of the distribution function of the normal distribution, find the probability that the absolute difference between the heads and the tails in 3600 tosses of a coin is greater than 90.
Standard Normal Distribution Table
1
Solution. Let X denote the random variable that corresponds to the number of heads. Then, X has binomial distribution Bi(3Q00,1/2) (with expected value 1800 and standard deviation 30), so by the de Moivre-Laplace theorem, for large val-
can
ues of n, the distribution function of the variable be approximated by the distribution function <P of the normal distribution. Thus, the wanted probability is
P = l- Pfl755 < X < 1845 = 1-P
-1.5 < X   1800 < 1.5
30
10.2.39. Skewness and kurtosis. Since the third central mo-lß$)\. ment is given in terms of third powers of devia-tions from the expected value, it expresses to a certain extent the symmetry of the random variable distributed around the expected value. In descriptive statistics, we describe this by the coefficient of skewness. For random variables, we use similarly the characteristic
E(X - EX)3
7i ~
(vWx)3 '
which is called the coefficient of skewness of a random variable X.
Another commonly used characteristic is the kurtosis of a random variable X, defined as
72
E(X-EX)4
-3.
(varX)2
The standard normal distribution has third central moment equal to zero and the fourth one equal to 3. Thus, the kurtosis is standardized so that its value for the standard normal distribution is zero. For a general distribution, the kurtosis provides comparison to the normal distribution.
In practice, there are other standardizations of skewness coefficients and kurtosis.
10.2.40. Law of large numbers. Now, we can consider the key tools which connect probability and statistics. We start with the generalization of Bernoulli's theorem about the binomial distribution, discussed at the end of subsection 10.2.32. The random variables ^X„, where Xn ~ Bi(n,p), can be viewed as the arithmetic means of n independent variables with distribution A(p), and Bernoulli's theorem then says that these means converge to p with probability 1.
Such a proposition holds in general. Independence of the variables is not needed, only the fact that cov(Xj, Xf) = 0 guarantees that the variances sum up.
The law of large numbers
Proposition. Consider a sequence ofpairwise uncorrelated random variables Xi, X2,... which have the same finite ex-pectedvalueE Xi = fi. Moreover, assume the variances are bounded, so that var Xi < C, for a fixed constant C. Then for any e > 0,
lim P(
1 "
n —'
\x\<e)= 1.
Proof. By the use Chebyshev's inequality just as at the end of subsection 10.2.32,
= 2<2>(-1.5) = 0.1336,
where the last value was learned from the table.
□
n 1 v-^ I      \     var( — y^"_i Xi — u)
M 7i ^-^ 1     ' el
i=l
= ^Er=ivar^ < sl
e2 ~ Tie2
711
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.1.8. The probability that a newborn baby is a boy is 0.515. Find the probability that there are at least the same number of girls as boys among ten thousand babies.
Solution.
PIX < 50001
X-5150
-150
V5150- 0.485    ^5150- 0.485
-iV(0,l)
= 0.00135
□
10.1.9. Using the distribution function of the standard normal distribution, find the probability that we get at least 3100 sixes out of 18000 rolls of a six-sided die.
Solution. We proceed similarly as in the exercises above. X has binomial distribution j5j(18000, 1/6). We find the expected value ((1/6) (18000) = 3000) as well as the standard deviation ^/((1/6)(1 - 1/6)18000) = 50. Therefore, the variable x~^q00 can be approximated with the distribution function <P of the standard normal distribution:
P[X > 3100] = P = P
X- 3000    3100 - 3000
>
50
X - 3000
50
> 2
50
= 1 -<2>(2) = 0.0228.
10.1.10. A public opinion agency organizes a survey of preferences of five political parties. How many randomly selected respondents must answer so that the probability that for each party, the survey result differs from the actual preference by no more than 2% is at least 0.95?
Solution. Let pi, i = 1... 5 be the actual relative frequency of voters of the i-th political party in the population, and let X{ denote the number of voters of this party among n randomly chosen people. Note that given any five intervals, the events corresponding to X{/n falling into the corresponding interval may be dependent. If we choose n so that for each i, Xi/n falls into the given interval with probability at least 1 — ((1 — 0.95)/5) = 0.99, then the desired condition is sure to hold even in spite of the dependencies. Thus, let us look for n such that P[|f - p\ < 0.02] > 0.99. First of all, we
Thus, the probability P is bounded from below by
1 71
71 -'
- ß \ < e) > 1
C
which proves the proposition.
□
Thus, existence and uniform boundedness of variances suffices for the means of pairwise uncorrelated variables Xi with zero expected value to converge (in probability) to zero.
10.2.41. Central limit theorem. The next goal is more ambitious. In addition to the law of large numbers, the stochastic properties of the fluctuation of the means Xn = ^ Y17=i -^-i around the expected value [i need to be understood. We focus first on the simplest case of sequences of independent and identically distributed random variables X{. Then formulate a more general version of the theorem and provide only comments on the proofs.
Move to a sequence of normalized random variables X{. Assume E X{ = 0 and var X{ = 1. Assume further that the moment generating function Mx (t) exists and is shared by all the variables X{.
The arithmetic means - y)"_, X{ are, of course, random variables with zero expected value, yet their variances are -% = -. Thus, it is reasonable to renormalize them to
1
E*»
which are again standardized random variables. Their moment generating functions are (see proposition 10.2.37)
MSn(t) = Ee"
■T,ix* _
Mx(-r=)
□   Since it is assumed that the variables X are standardized,
t
1 + 0^ + 1 — + o -
\/n       In n
where again o(G(n)) is written for expressions which, when divided by G(n), approach zero as n —> oo, see subsection 6.1.16.
Thus, in the limit,
lim Ms„ (t) = lim
1 + — +o(-
2n Kn'
■ e 2
This is just the moment generating function of the normal distribution Z ~ N(0,1), see the end of subsection 10.2.35. Thus, the standardized variables Sn asymptotically have the standard normal distribution.
We have thus proved a special version of the following fundamental theorem. Although the calculation is merely a manipulation of moment generating functions, many special cases were proved in different ways, providing explicit estimates for the speed of convergence, which of course is useful information in practice.
Notice that the following theorem does not require the probability distributions of the variables X{ to coincide!
712
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
rearrange the expression:
--p < 0.02
A
-0.02 <--p < 0.02
n
=   P [—0.02 ■ n < X — pn < 0.02 ■ n] =
-0.02 ■ n
X — pn
0.02 ■ n
yjnp{\-p)     yjnp{\-p) yjnp{\-p)
2<P
0.02 ■ n
^np(l-p)
( 0.02
0.02 ■ n
^np(l-p)
1,
\\/np(l-p)/
where <P is the distribution function of the normal distribution. Thus, let us solve the inequality
0.02 ■ n
2<P
^/np(\-p) 0.02 ■ n
y/np(l-p)
- 1   > 0.99
> 0.995
Since the distribution function is increasing, the last condition is equivalent to
> *-(0.995)
\Aw - p)
"-02-"      > 2.576
^np(\-p)
>   50-2.576- Vp(l-p)
> (25-2.276)2-4147
Here, we used the fact that the maximum of the function p(l — p) is \, and it is reached at p = \. We can see that if e. g. p = 0.1, then \Jp(l — p) = 0.3 and the value of the least n is lower. This accords with our expectations: for less popular parties, it suffices to have fewer respondents (if the agency estimates the gain of such party to be around 2 % without asking anybody, then the wanted precision is almost guaranteed).
□
10.1.11. Two-choice test. Consider random vectors Y\ and Y2 all of whose components are pairwise independent random variables with normal distribution, and suppose that the components of vector Y{ have expected value and the variance a is the same for all the components of both vectors.
Central limit theorem
Theorem. Consider a sequence of independent random variables Xi which have the same expected value E Xi = [i, variance var Xi = a2 > 0 and uniformly bounded third absolute moment E\Xi\3 < C. Then, the distribution of the random variable
1 "
Xi- p
satisfies
lim P(Sn <x)= <P(a
where & is the distribution function of the standard normal distribution.
Note that the central limit theorem gives a result on asymptotic behaviour which says that the distribution functions of certain variables approach the standard normal distribution. Such behaviour is called convergence in distribution. This type of convergence is weaker than convergence in probability.
The assumption that all Xi are independent and identically distributed was not fully exploited in the argumentation above. Only the knowledge of E Xi = 0 and var Xi = 1 was used. The assumption of the uniformly bounded third absolute moments of Xi can be used to prove the existence of the moment generating functions. The estimate E | Aj |3 can then be used to complete the above proof exactly as above.
There are many more general results. We mention at least the Lyapunov's central limit theorem formulated as follows: Consider a sequence of random variables Xi with finite expected values /i2 and variances a 2. Write
and assume for some S > 0
lim
0.
Then X'~ß' converges in distribution to Z ~ N(0,1).
The previous version of the central limit theorem is derived by choosing (5 = 1. Then sn = a^fn and the condition of the Lyapunov's theorem reads
n
0 < lim n-3/2ö--3V E |A2|3 < Ca'3 lim n"3/2+1= 0.
n—too ^-—J n—too
i=l
10.2.42. De Moivre-Laplace theorem. Historically, the first formulated special case of the central limit ^ I theorem was that of variables Yn with binomial distribution Bi(n,p). They can be viewed as the sum of n independent variables X{ with Bernoulli distribution A(p),0 < p < 1. These variables have moment generating functions and E |X{ \3 = p < 1.
713
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Use the general linear model to test the hypothesis whether/ii = fi2.
Solution. We will proceed quite similarly as in subsection 10.3.12 of the theoretical part. This time, we can write both vectors Y{ into one column, and we consider the model
/HA    A o\
Thus, the central limit theorem says in this case that the random variables
1 71
TTtťE
Xi
-p
X
np
Y\nx Y2i
1 0 1 1
+ aZ.
We will work with arithmetic means of the individual vectors Yi and Y2. Direct application of the general formula from theory gives the estimate b in the form
h\ _ fni+n2 n2 b2) ~ \   n2 n2
= J_(n2       -n2   \ /niYi +n2Y2\ = (_Yi nin2 \-n2   n1+n2J\     n2Y2     )     \Y2 -
and for the matrix C = (X7" X) ~1, where X is the 2-column
matrix with zeros and ones from our model, we have
1 1
711Y1 +n2Y2 n2Y2
^~{\yJp^-p)J     VM1 - P)
behave asymptotically as the standard normal distribution.
This can be formulated: the random variable X ~ Bi(n,p) behaves as the random variable with normal distribution N(np, np(l — p)) as n increases.
This behaviour is demonstrated exactly in the illustration at the end of 10.2.21.
In practice, approximation of the binomial distribution by the normal distribution is usually considered appropriate if np(l — p) > 9.
We illustrate the result with a concrete example. Suppose it is desired to know what percentage of students like a given course, with an error of at most 5 %. The number of people who like the course among n randomly chosen people should behave as the random variable X ~ Bi(n,p). Further, suppose the result is desired to be correct with confidence (i.e., ■probability again) to at least 90%. Thus,
X -p\< 0.05
0.9
C =
ni      ni Ji2
Thus, we test the hypothesis [i\ = fi2, which means that we
test whether fa = 0. For this, it is suitable to use the statistic
_     _ 1 T _ Y2 - Y1 ( nxn-j
S \ni+n2/
where the standard deviation S is substituted as
is desired by choosing a high enough number n of students to ask.
Approximate
0.9 :
S2 =
ni+n2 — 2
^(rlí-y1)2 + ^(y2í-Y2;
= 1 i = l
The distribution of this statistic is tni+n2_2, so the null hypothesis hi = fi2 is rejected at level a if we have
\T\ > t„1+„2_2(a). □
10.1.12. In JZD1 Tempo, the milk yield of their cows was measured during five days, the results being 15, 14, 13, 16 a 17 hectoliters. In JZD Boj, which had the same number of cows, they performed the same measurement during seven days, the results being 12,16,13, 15, 13,11,18 hectoliters.
a) Find the 95% confidence interval for the milk yield of JZD Boj's cows, and the 95% confidence interval for the milk yield of JZD Tempo's cows.
b) On the 5% level, test the hypothesis that both farms have cows of the same quality.
P = P ~ á> = 2á>
1
-X -p\< 0.05
0.05n
X
np
0.05n
y/np(l-p) ^np(l-p) ^np(l-p) 0.05n    \       / 0.05n
0.05n
y/np(l-p)
-1,
y/np(l-p)
where the symmetry of the density function of the normal distribution is exploited. Thus,
0.05n
y/np(l-p)
1
(1 + 0.9) = 0.95
is wanted. This leads to the choice (recall the definition of critical values z(a) for a variable Z with standard normal distribution in subsection 10.2.30)
0.05n
2(0.05) = 1.64485.
Since p (1 — p) is at most \, the necessary number of students can be bounded by n > 270, independently of p.
JZD —jednotné zemědělské družstvo — an agricultural cooperative farm, created by forced collectivization in 1950s in Czechoslovakia.
714
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Suppose that the milk yield of the cows in each day is given by the normal distribution. Solve these problems assuming that there are no data from previous measurements, and then assuming that the previous measurements showed that the standard deviation was a = 2 hi.
Solution. First of all, let us compute the results for the known variance. In order to find the confidence interval, we use the statistic _
which has standardized normal distribution (see 10.2.21). Then, the confidence interval is (see 10.3.4)
X-^(a/2),X + ^(a/2)
where a = 0,05. Now, it suffices to substitute the specific
values. For 1ZD Tempo, we thus get the sample mean
— 15 + 14 + 13 + 16 + 17 lr X1 =---= 15,
and using appropriate software, we can learn that z(0.025) = 1.96, which gives the interval
15 - 4=1.96,15 + 4=1.96 | = (13.25; 16.75).
For 1ZD Boj, we get
12 + 16 + 13 + 15 + 13 + 11 + 18
X, =
7
= 14,
so the 95% confidence interval for the milk yield of their cows is
(12.52; 15.48).
If the variance of the measurements is not known, we use the so-called sample variance for the estimate. In order to find the confidence interval, we use the statistic
which has Student's distribution with n — 1 degrees of freedom (see also 10.3.4). Then, we can analogously obtain the 95% confidence interval
X - 4=*n-i(a/2),X + 4=*n-i(a/2)
For the values of 1ZD Tempo, we get the sample variance
S2 = 02 + (-l)2 + (-2)2 + l2 + 22) = ^
i. e., S = 1.58. Further, we have t4(0,025) = 2,78, so the 95% confidence interval for 1ZD Tempo is
(13.03; 16.97).
10.2.43. Important distributions. In the sequel, we return v£sL ■/?> t0 statistics- It should be of no surprise that fMZ2ks we work with the characteristics of random vec-sS3g2~» tors similar to the sample mean and variance, as well as relative quotients of such characteristics, etc. We consider several such cases.
Consider a random variable Z ~ N(0,1), and compute the density fy(x) of the random variable Y = Z2. Clearly, fy(x) = 0 for x < 0, while for positive x,
FY(x) = P(Y <x) = P(-Vx~ < Z < yf£)
1
V2tt
Differentiation leads to
d
-*2'2dZ:
1
Mx) = -FY(x)
/2tt
2tt
-.X-1'2 e-x'2 .
r1'2 e"*/2 dt.
This distribution is called x2 with one degree of freedom, written Y ~ x2-
We work with sums of such independent variables. All fall into a general class of distributions whose densities are of the form
fx(x) = cxa-1e-bx for x > 0, while fx (x) = 0 for non-positive x, i.e., the distribution x2 corresponds to the choice a = b = 1/2. This case is already thoroughly discussed as an example in subsection 10.2.20. Hence such a function is the density for the constant c = T^y- Thus, it is the distribution T(a, b) with density, for positive x,
fx(x) =
In general, the fc-th moment of such variable X is easily computed:
, ba
E X =
o r(a)
r(a + k)
' dx
xa-1+ke-bxdx
foa+k
r(a)bk J0    r(a + k) _ r(a + k) ~   r(a)bk '
since the integral of the density of r(a + k, b) in the last expression must be equal to one
In particular, E X
r(a + 2)
_ r(q+l) _ a
varX
br(a)
a2 " b2 =
, while
(a + l)a — a2
b2r(a)      b2 b2 b2'
Similarly, the moment generating function can be computed for all values t, —b < t < b
Mx(t)= i    etx -^xa~L e~bx dx =
r(ay
(b-trj0
[b-ty
' xk(bt)^xa-l e-{b-t)X dx
r(a)
715
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
For JZD Boj, we get the sample variance S| = 6, so the wanted confidence interval is
(11.73; 16.27).
b) If we compare the expected values of milk yield in both farms, then this is a comparison of the expected values of two independent choices from the normal distribution. In the case of unknown variances, we further assume that the variance is the same for both farms.
Thus, let us examine the hypothesis assuming the known variances of = of = 4. We use the statistic
(X1 - X2) - (/ix - /i2 _
u =
"1     I "2
X\ — x2
N(0,1),
"1     I "2
where [i\ and [i2 are the unknown expected values of milk yield in the examined farms, and n1, n2 are the numbers of measurements. This statistic has, as indicated, the standardized normal distribution. We reject the hypothesis at the 5% level if and only if the absolute value of the statistic U is greater than 20.025, i- e-> if and only if 0 does not lie in the 95% confidence interval for the difference of the expected values of milk yield in both farms. For the specific values, we get
£7=4=^ = 0.854.
4,4
Thus, we have \U\ < 2(0.025) = 1.96, so the hypothesis that the expected values of milk yield are the same in both farms is not rejected at the 5% level. The reached p-value of the test (see 10.3.9) is 39.4%, so we did not get much closer to rejecting the hypothesis (the probability that the value of the examined statistic is less than 0.854 provided the null hypothesis holds is 60.6%.
If we do not know the variances of the measurements but we know that they must be equal in both farms, we use the statistic
(X1 - X2) - (/ix - /i2) _
K
S,
X\ — x2
ni n2
ni+n2— 2,
where
(ni - 1)5° + (n2 - 1)51 n1+n2 — 2
Thus, for the sum of independent variables Y = X\ + • • • + Xn with distributions X{ ~ r(a,, b), the moment generating function (for values \t\ < b) is obtained
MY(t) =
b-t
aiH-----ha„
that is, Y ~ T(a1 H----+ an, b). It is essential that all of the
gamma distributions share the same value of b.
As an immediate corollary, the density of the variable Y = Z\ + ■ ■ ■ + Zl is obtained, where Zt ~ N(0,1). As just shown, this is the gamma distribution Y ~ T(n/2,1/2); hence its density is
Iy(-X) = 2™/2r(n/2)
xn/2~1e~x/2 .
This special case of a gamma distribution is called x2 with n degrees of freedom. Usually, it is denoted by Y ~ xW-
10.2.44. The F-distribution. In statistics, it is often wanted ■J} 1. to compare two sample variances, so we need to consider variables which are given as a quotient
X/k
u■ =
Y/r
where X ~ x\ and Y ~ x2m-
Suppose fx (x) and fy (x) are the densities of indepen-dentrandom variables X and Y. Suppose fy is non-zero only for positive values of x. Compute the distribution function of the random variable U = cX/Y, where c > 0 is an arbitrary constant. By Fubini's theorem, the order of integration can be interchanged with respect to the individual variables.
Fu(u) = P(X < (u/c)Y)
uy/c
fx(x)fy(y) dxdy
y
fx(ty/c)fy(y)dt)dy
yfx(ty/c)fY(y) dy)dt
This expression for Fjj (u) shows that the density fu of the random variable U equals
Iu{u) =
yfx(uy/c)fY(y) dy.
Substitute the densities of the corresponding special gamma distributions for X ~ x\ and Y ~ Xm- Set c = ra/k. The random variable U
(k/mfl2
= Yjn\ nas density fuiu) equal to
2(fc+™)/2r(fc/2)r(m/2) J0
(fc+m)/2-l -y(l+ku/m)/2
dy.
The integrand in the latter integral is, up to the right constant multiple, the density of the distribution of a random variable Y ~ T((k + m)/2, (1 + ku/m)/2). Hence the multiple can be rescaled (notice u is constant there) in order to get
716
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
For the specific values, we get K = 0.796, \K\ < fiO(0,025) = 2.2281, so again, the null hypothesis is not rejected. The reached p-value of the test is 44.6%, which is even greater than in the above test. □
10.1.13. Analyzing the variance of a simple sort. For k >
2 independent samples Y{ of size n{ from normal distributions with equal variance, use a linear model to test the hypothesis that all the expected values of individual samples are equal.
Solution. The technique is quite similar to that of the above exercise. The hypothesis to be tested is equivalent to stating that a submodel in which all the components of the random vector Y created by joining the given k vectors Y{ have the same expected value holds.
Thus, the used model is of the form
(Y11)			0 ■	■ 0\	
Y\ni		l	0 ■	■ 0	
Y2i	—	0	1 ■	■ 0	
Ykl		0	0 ■	■ 1	\ßk)
			0 ■	■ i/	
+ aZ.
We can easily compute estimates for the expected values \ii using arithmetic means:
1 n{
yi = xt /E Yii
Hence we get the estimate Yj = Y,$o the residual sum of squares is of the form
k rti i=l j=l
The estimate of the common expected value in the considered submodel is
n ^ ^   J     n ^
i=l j=l i=l
where n = nH-----Ynk, and the residual sum of squares in
this submodel is
k rii i=l i = l
In the original model, there are k independent parameters fii, while in the submodel, there is a single parameter fi, so the
the integral to evaluate to one. The density fu(u) is then expressed as
^t?)/?U-VWi+-"V(fc+m)/a-
r(k/2)r(m/2) \m) \     m J
This distribution is called the Fisher-Snedecor distribution with k and m degrees of freedom, or F-distribution in short.
10.2.45. The t-distribution. One encounters another useful distribution when examining the quotient of variables Z ~ N(0,1) and \/X/n. Here X ~ xn- (We w& interested in the quotient of Z and the standard deviation of some sample).
Compute first the distribution function of Y = Vx (note that X, and hence Y as well, take only positive values with non-zero probability)
FY(y) = P(VX <y) = P(X < yz) 1
o 2"/2r(n/2)' v 1
n/2-l e-x/2 dx
J0 2™/2-ir(n/2) Hence the density of the random variable Y is
1
My)
2"/2-1r(n/2)
The same method can be used as in the previous subsection with the random variable U = cZ/Y, setting c = ^/n, Y = \fX. This leads to the random variable
T
Similar computation as the one above yields that the density fr satisfies
1 (n/ljy/n'K \      n J
This is called the Student's t-distribution with n degrees of freedom.
10.2.46. Multidimensional normal distribution. Consider a random vector Z = (Z1,..., Zn) with independent components ~ N(0,1). Then its covariance matrix is equal to the unit matrix, i.e.,
var Z = ln.
Random vectors are often encountered which are an affine transformation U = a + BZ of such a vector Z, where a is an arbitrary constant vector in Rm and B is an m-by-n constant matrix.
As derived in theorems 10.2.29 and 10.2.35, these random vectors have expected value E U = a and covariance matrix var U = V = BBT (since the covariance matrix of Z is the identity matrix). Therefore, this covariance matrix is always positive-semidefinite.
The random vector U is said to have multivariate normal distribution Nm(a, V).
717
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
tested statistic is of the form
(n - k) (RSS3 - RSS)
F ■
(k-l)
RSS
□
J. Linear regression
We already met the linear regression in chapter three, subsection ??. Now, we will try to apply the same principle to problems which are often studied by statisticians.
One standard application of the linear regression is "laying a line" through given data. Thus, we have a sequence of measurements for which we record the values of two variables between which we anticipate linear dependency. A classical example is the dependency of a son's height on his father's height.
10. J.l. Find the linear regression model for the dependence of Y on X, based on the following lists of measured data: X = [1,4,5,7,10],F = [3,7,8,12,18]. Solution. In order to find the parameters of the regression line, use the formulas derived in 10.3.12. Using the method of least squares, we try to minimize the distance of the vector b1X + b0 from the vector Y with respect to the parameters b1 and b0. This distance, as we know from chapter two, is minimal for the orthogonal projection of the vector Y onto the vector subspace generated by the vectors (1,..., 1) and (x1,... ,xn). For parameters b0, b1 of the regression line Y = b1X + b0, we obtain
Er=1(>-s)(yi-y) _
Er=i(xi ~s)2
(1 - 5.4) (3-9.6) + •
For any multivariate normal distribution Nm (a, V), consider again the affine transformation
W = c + DU
with a vector of constants c e Rk and an arbitrary fc-by-m constant matrix. Direct calculation leads to
W = c + D(a + BZ) = (c + Da) + (DB)Z,
which is a random vector W ~ Nfc(c + Da, DBBT DT). Thus, the covariance matrix of the multivariate normal distribution behaves as a quadratic form with respect to affine transformations.
This straightforward idea shows that any linear combination of components of a random vector with the multivariate normal distribution is a random variable with the normal distribution. Similarly, any vector obtained by choosing only some of the components of the vector U is again a random vector with the multivariate normal distribution.
Note that when the random vector Z ~ Nn(0,In) is transformed with an orthogonal matrix QT, then the joint distribution function of the random vector U = QTZ can be computed directly. If the transformation in coordinates as t = QTz is written, then its inverse is z = Qt, and the Jaco-bian of this transformation is equal to one. Hence (note that J2i zi — J2i t2- As in chapter 3, write z < u if all components satisfy Zi < ui)
Fu(u) = P(Ui <Ui,i = \,...,n) =
(27T)-"/2e-Ew/2 dzi...dZn
QTz<u
(2tt)-™/2 e-^t*'2dt1---dtn
■ + (10-5.4)(18-9.6)
((1 - 5.4)2 + (4 - 5.4)2 + (5 - 5.4)2 + (7 _ bAy + (10 . =1.677.
(27T)-1/2e-*Viiy.-■■■(r(2rr)-^e-^dtn
_ _ \J —OO
=FUl(ui)---Fu„(un)
Now, we can easily calculate the coefficient b0:
b0 = Y -bxx = 0.5442. Therefore, the wanted linear dependency is
Y = 1.677- X + 0.5442.
Note that in this model, the roles of the variables X and Y are totally equal. Using the same method, we could have obtained the dependency of X on Y:
X = 0.5867- Y - 0.2322.
□
Hence it directly follows that all components of the random vector U are again independent, and U ~ N„(0, ln).
3. Mathematical statistics
The data processing in mathematical statistics is based on quite sophisticated mathematics, but the actual methods are w '    also very much dependent on the inputs from the diverse application fields.
Here we restrict attention to a few modest notes about statistical methods. We suggest curious readers to look for more detailed literature (which would reflect the field of application as well).
718
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Remark. Think out why the linear regression model of the dependency of X on Y cannot be obtained by merely expressing X from the linear regression model of the dependency of YoaX.
Remark. In many real situations, the dependency of the variables is clearly given, if one of the variables is time, for example.
10.J.2. An orbital station has measured, at the same instant of five consecutive days, the following velocities of an unknown cosmic object (in km/s): 10,11.4,13.1,15.8, and 18.7. Estimate the object's velocity on the tenth day. Solution. Here, it is good to notice that the velocity does not change linearly with time (the acceleration is increasing). Thus, we can hypothesize that the object is being attracted to another one with the gravitational force. Then, its velocity would be a quadratic function of time. So let us use the method of least squares to lay a quadratic function (as precise as possible) through the measured data. The procedure is the same as if we made the linear regression of the vector v = (v1, v2,vn) dependent on x = (x1,...,xn) and x2 = (x2,..., a;2). This method is called quadratic regression. Thus, we are looking for a vector of parameters b = (b0, b1, b2) so that the variable b2x2 + b\x + b0 would estimate y. Let us build the matrix X of the values of independent variables:
X
'\ X\
,1
	i	1\
i	2	4
i	3	9
i	4	16
V	4	25/
and the vector of parameters b = (b0, b1, b2) can be computed by (1):
b= (XT X)~1XTv = (9.26; 0.47; 0.29). Then, the wanted quadratic estimate is
v = 0.29a;2 + 0.47a; + 9.26,
so the estimated velocity on the tenth day is approximately 42.96 km/s. In the model of classic linear regression, we would get
v = 2.18a;+ 7.26,
which yields 29.06 km/s for the tenth day. The difference between these estimates is quite large. This illustrates that analysis of the situation is a very important part of statistics.
□
10.3.1. Introductory ideas. In the descriptive statistics at the beginning of this chapter, we tried to equip the data sets with some characteristics which would carry essential information such as the sample mean, variance, etc.
Mathematical statistics works with some sample of a given data set, trying to describe to what extent the obtained statistics are relevant, or to find or improve an appropriate theoretical model for the behaviour of the entire data set, based on the collected data. The model is then used to either accept/reject a hypothesis about the data set, or to estimate the probability of an event that might happen in the future.
Consider an easy example: construct a wooden coin with heads and tails. Toss it n times, noting that it comes up heads k < n times. From this experiment, attempt to deduce the probability that the coin comes up heads in both of two more tosses.
There are two fundamental approaches to this problem. The first one is the classical statistics (orfrequentist statistics). Build on the assumption that the individual tosses are independent and have the same probability of coming up heads, which is given by an objectively existing parameter 6 = p (unknown to us so far). Thus, the individual tosses are considered to be the realization of a random variable X with Bernoulli distribution. The probability of getting k heads out of n tosses is given by the binomial distribution, and it is expected that the "best possible" estimate for the parameter p is the ratio 6 = k/n. Usually the confidence of such an estimate is also wanted. This can be obtained from knowledge of the total number n of tosses and the asymptotic behaviour of the model as n increases. For instance, if the coin comes up heads 8 times out of 10, With a certain (mathematically estimated) confidence, it can be stated that the probability of the coin coming up heads in both of the subsequent tosses is 0.82 = 0.64, a number much more than half.
The other possibility is quite different. Consider the parameter 6 to be a random variable from some chosen family of distributions, the collected data to be constants, and then try to deduce how to adjust the probability distribution of this random variable 6. For example, suppose a (perhaps fair) coin is created, i.e. the expected value is (close to) 0.5, but the precision of the production ensures this only up to some small e > 0. The experiment of tossing the coin n times allows the adjustment of the distribution within the preferred class. Thus, we build on some assumptions about the distribution and adjust the prior distribution in view of the experiment. This approach is called Bayesian statistics.
The first approach is based on the purely mathematical abstraction that probabilities are given by the frequencies of event occurrences in data samples which are so large that they can be approximated with infinite models. The central limit theorem can be used to estimate their confidence. From the statistical point of view, the probability is an idealization of the relative frequencies of the cases when an examined result
719
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
K. Bayesian data analysis
10.K.1. Consider the Bernoulli process denned by a random variable X ~ Bi(n, 6) with binomial distribution, and assume that the parameter 6 is a random variable with uniform distribution on the interval (0,1). We define the success chance in our process as the variable 7 = What is the density of this variable 7?
Solution. Intuitively, we can feel that the distribution is not uniform.
Denoting the wanted probability density by /(s), we can use the relation between 6 and 7 to compute 6 = ■ In addition, we can immediately see that the probability density of 7 is non-zero only for positive values of the variable. Now, we can formulate the statement as the requirement
(l)   e = p(e<&) = p(1<Y^T) = £ f(s)ds,
where r = . However, the right-hand upper bound contains the changing limit 7, so we get the denning formula for
m
/(s) = (ttt) = VTrJ-
Indeed, the wanted density gives much higher probability to low values of the chance than to high ones. □ We could see in subsection 10.3.7 that when taking the Bayesian approach with binomial model of probability distribution of a random variable X ~ Bi(n, 6), then we are interested in its probability mass function fx (k) = (£) 0fe (1 — 6)n~k. This function can be viewed as the conditional probability P(6\X = k) for the uniform a priori probability distribution of the variable 6 on the interval (0,1). Thus, it is just the a posteriori probability distribution of 6 corresponding to the result X = k of the experiment. The following exercise concerns the general class of these probability distributions.
10.K.2. Find the basic charakteristics of the so-called beta-distribution fi(a, 6) with probability density of the form
fY=\Cya-1{l-yf-1   yG (0,1) 10 otherwise.
Solution. The constant C must be chosen as the multiplicative inverse of the integral ya_1(l — y)b~1dy, which is a function B(a, 6), known as beta-function in mathematical analysis and other sciences (e. g. physics). The function gamma, which generalizes the discrete values of factorial,
occurs in many repeated experiments. This seeming advantage/rigor can become a disadvantage as soon as we are interested in the confidence of the data themselves and the suitability of the chosen experiment. The same problem occurs if we want to use frequentist statistics to estimate the probability of one or more outcomes of an experiment that is executed only once.
On the other hand, Bayesian statistics is an example of applying mathematics to "common sense" when we want to adjust our belief in light of new information.
It is interesting that, from the historical point of view, the first approach was the Bayesian one (for instance, Laplace and more as early as in the 18th century), which succumbed to frequentist statistics in the 20th century. In recent decades, Bayesian statistics has been returning, together with further new approaches.
10.3.2. Random sample of a population. Describe the first
approach of the above subsection. Thus, assume that there is a (huge) basic statistical set of N units, which is called the population, and each of the units has a numerical characteristic, i.e.,
there is a set of values (x1,..., x^). From this set there is only a sample with values (Xi,..., Xn).
In order to avoid the discussion of the actual size of the basic statistical set of N units, assume that the items of the sample are selected one by one and every item is always put back into the population. In addition, assume that every item has the same probability l/N of being chosen. This is a random sample.
The way of realizing the random sample can be viewed as working with a vector (X1,..., Xn) of independent, identically distributed random variables. In particular, they have the same distribution function Fx (x) and moments
EXi = [i,    v&r Xi = a2.
The next step must be a derivation of the characteristics of the sample mean X and the sample variance
1 71
i=l
The following theorem explains why the coefficient ^-j- is selected instead of -\, which is the case with s2 in subsection 10.1.6.
Theorem. The sample mean X computed from a random sample of size n whose distribution has finite expected value fi and finite variance a2 satisfies
EX = fi,   YwX = -a2.
n
The sample variance Sa satisfies
ES2 =<j2.
Proof. As derived in subsection 10.2.29,
1     71 1 EX = -E Vl, = -nil = fi. n f—' n
720
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
emerges in the following calculation:
/*oo /*oo
r{x)r{y)=       e-H^dt-       e-ssy-1ds = Jo Jo
/>oo />oo
e-t-stx-1sy-1dtds =
JO Jo
(substitution t = rq,s = r(l — q))
r=0 Jq=0
r(rq)x-1(r(l-q))v   rdqdr =
1
ta~'-(l-t)b-1dt
B(a,b) jo
When differentiating, we must use the rule for differentiation of integral with variable upper bound. Thus, we get for the
Since the variables Xt are independent, additivity of variance can be used (derived in subsection 10.2.33). The variance behaves as a quadratic form with respect to multiplication by a scalar. Hence
1 71 ?X = — var \^ X{ =
r>z a—^
—no = -a
nA n
e-rrx+y-ldr.  /    q*-1(l-q)y-1dq =
lr=0 Jt=0
= r(x + y)B(x,y). Thus, we get the general formula
BM)"W)
and it follows from properties of the gamma-function for positive integers a, b that
t-w              ,           k\(n-k)\        1 /n\_1 B(ti - fc + 1, fc + 1) = —---f =-
We can directly compute that the expected value of the variable X ~ fi(a, 6) with beta-distribution is (applying r(z +
i) = zr(z))
EX= B(a + l,Z>) = a
B(a,b)        a + b'
If a = b, then the expected value and median are \. We can also directly calculate the variance
var X = E (X - E X)2 =--—^--.
V >      (a + b)2(a + b+l)
Thus, for a = b, we get var X = , which shows that the variance decreases as a = b increases. For a = b = 1, we get the ordinary uniform distribution on the interval (0,1). □
10.K.3. In the situation as in the problem above the previous one, assume that the success chance 6 in the Bernoulli process is a random variable with probability distribution fi(a, 6). What is the probability distribution of the variable 7 = ? In what is it special when a = b = pi
Solution. We have already discusses the special case with uniform distribution fi(l, 1). Thus, we can continue with the equality ||1||, where we used the form of this distribution. Now the left-hand side contains, instead of 0, the expression
The formula
n n
Y(*i - p)2 =      - x'f + <x - p)2
i=l i=l
can be verified simply by expanding the multiplications. Thus:
1 n
Vs2 = ^YJ{Xl-H)2--{X-l>)2 = n z—' n
i=l
1 "
= — >   var Xj — var X =
71 ^ J
i=l
= (1-V-
v 71
That is why the variance s2 is multiplied by the coefficient ^-j-, which leads just to the sample variance S2 and its expected value a. Of course, this multiplication makes sense only if 71 7^ 1. □
10.3.3. Random sample of the normal distribution. In
jjfi 1, practice, it is necessary to know not only the numerical characteristics of the sample mean and the variance, but also their total probability distributions. Of fS course, it can be derived only if the particular probability distribution of X{ is known. As a useful illustration, calculate the result for a random sample of the normal distribution.
It is already verified, as an example on properties of moment generating functions in 10.2.37, that the sum of random variables with the normal distribution results again in the normal distribution. Hence the sample mean must also have the normal distribution, and since both its expected value and variance are known, X ~ N(/i, ^c2).
The probability distribution of the sample variance is more complicated. Here, apply the ideas about multivariate normal distributions from subsection 10.2.37. Consider a vector Z of standardized normal variables
X{ - n
The same property holds for the vector U = QTZ with any orthogonal matrix Q. In addition, J2i U2 = J2i ■ Choose the matrix Q so that the first component U1 equals the sample mean Z, up to a multiple. This means that the first column of Q is chosen as (y/7i)~1(l,..., 1). Then Uf = nZ2, so we
721
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
wanted density that
B(a, &)/(*)
1
s    \ 1
s + lj      \~    s + lj      (s + 1)2
a-1 \ a+b
,S + 1,
The picture shows the densities for a   =   b   =   p =
= 2, 5, 15.
can compute:
n n n
YJUf = YJZ2l=^-Z? + nZ2
i=l i=l i=l
n n - n
i = 2 2 = 1 2 = 1
Therefore, a multiple of the sample variance ^^-S2 is the sum of n — 1 squares of standardized normal variables, so the following theorem is proved:
Theorem. Let (Ai,..., Xn) be a random sample from the N(/i, a2) distribution. Then, X and Sa are independent variables, and
A-N^V),   -Zls2 r.x2n_x. n az
Hence, it immediately follows that the standardized sample mean
-X -fi
This enforces the intuition that the same and not too small values of a = b = p correspond to the most probable value 0 = \, so the density of the chance is greatest around one. The higher p, the lower the variance of this variable. □
10.K.4. Show that the Bernoulli experiment, described by a random variable A ~ Bi(n, 6), and the a priori probability of a random variable 6 with beta-distribution, the a posteriori probability also has beta-distribution with suitable parameters which depend on the experiment results. What is the a posteriori expected value of 6 (i. e., the Bayesian point estimate of this random variable)?
Solution. As justified in subsection 10.3.7 of the theoretic part, the a posteriori probability density is, up to an appropriate constant, given as the product of the a priori probability density
1
9(9) =
-9a-l(l-9)b
B{a,b)
and the probability of the examined variable A provided the value of 6 occurred. Thus, assuming k successes in the Bernoulli experiment, we get the a posteriori density (the sign used instead of equality denotes "proportional")
g(6\X = k) cc P(X = k\9)g{9) cc
cc ek(\ - e)n-kea-1(i - ef-1 =
_ ga-\-k— 1^ _      Q'jb-\-n—k— 1
T = Jn-
S
has Student's t-distribution with n — 1 degrees of freedom.
10.3.4. Point and interval estimates. Now, we have every-thing needed to estimate the parameter values in the context of frequentist statistics. Here is a simple example. Suppose there are 500 students enrolled in a course, each of which has a certain degree of satisfaction with the course, expressed as an integer in the range 1 through 10. It may be assumed that the satisfactions X{ of the students are approximated by a random variable with distribution N(/i, a2). Further, suppose a detailed earlier survey showed that fi = 6, a = 2.
In the current semester, 15 students are asked about their opinion about the course, as rumour has it that the new lecturer might be even worse. The results show that 2 students vote 3, 3 vote 4, 3 vote 5, 5 vote 6, and 2 vote 7. Altogether, the sample mean is A = 5.133 and the sample variance is S2 = 1.695.
By assumptions, A ~ N(/i, a2 jn), so Z = ^/n—^ ~ N(0,1). In order to express the confidence of the estimate, compute the interval which contains the estimated parameter with an a priori fixed probability 100(1—a)%. Wetalkabout a confidence level a, 0 < a < 1. Consider fi to be the unknown parameter, while the variance can be assumed (be it correct or not) to remain unchanged. It follows that
1 - a = P(\Z\ < z(a/2)) = P
-X-n
< z(a/2)
= P[X--Z=z(a/2)<iJ.<X + -Z=z(a/2)),
V     vn vn J
and an interval is found whose endpoints are random variables and which contains the estimated parameter fi with an a priori fixed probability. The middle point of this interval is called the point estimate for parameter fi; the whole interval is called the interval estimate. We can also say that at the
722
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Thus, we have indeed obtained the density (up to a constant, which we need not evaluate) of the a posteriori distribution for 6 with distribution B(a + k, b + n — k). Its a posteriori expected value is
9=   a+k . a + b + n
For n and k approaching infinity so that k/n —> p, our a posteriori estimate also satisfies 6 —> p. Thus, we can see that for large values of n and k, the observed fraction of successful experiments outweighs the a priori assumption. On the other hand, for small values, the a priori assumption is very important. □
10.K.5. We have data about accident rates for N = 20 drivers in the last n = 10 years (the fc-th item corresponds to the number of years when the fc-th driver had an accident):
0, 0,2,0,0,2, 2,0,6,4,3,1,1,1,0,0, 5,1,1,0.
We assume that the probabilities pj, j = 1,..., N, that the j-th driver has an accident in a given year are constants.
For each driver, estimate the probability that s/he has an accident in the following year (in order to determine the individual insurance fee, for instance). 2
Solution. We introduce random variables X{j with value 0 if the j-th driver has no accident in the j-th year, and 1 otherwise. The individual years are considered to be independent. Thus, we can assume that the random variables Sj = ___™=i Xji that correspond to the number of accidents in all the n = 10 years have distribution Bi(n,pj).
Of course, we could estimate the probabilities for all drivers altogether, i. e., using the arithmetic mean
However, consider the homogeneity of the distribution of the variables Xj, they can hardly be accounted equal, so such estimate would be misleading.
On the other hand, the opposite extreme, i. e., a totally independent and individual estimate
Pj = -A7
J     n J
is also inappropriate, since we surely do not want to set zero insurance fee until the first accident happens.
confidence level a, the estimated parameter p is or is not different from another value po. Suppose for instance, the data and levels are a = 0.05 and a = 0.1. Respectively we obtain the intervals
p e (4.121, 6.145),   p e (4.284, 5.983).
Considering the confidence level of 5%, we cannot affirm that the opinion of students are worse compared to the previous year because the mentioned interval also contains the value Po = 6. We can conclude this if we take the confidence level of 10% since the value po = 6 no more lies in the corresponding interval.
On the other hand, if it is assumed that the other (worse) lecturer causes the variance of the answers to change as well (for instance, the students might agree more on the bad assessment), we proceed differently. Instead of the standardized variable Z, deal in a similar way with the variable
X
S
^his problem is taken from the contribution M. Friesl, Bayesovské odhady v některých modelech, published in: Analýza dat 2004/n (K. Kupka, ed.), Trilobyte Statistical Software, Pardubice, 2005, pp. 21-33.
As seen, this random variable has probability distribution T ~ t„_i, where n = 15 in this case. This leads to the interval estimate
X - Ai„-i(a/2) <P<X + -^t„_1(a/2).
Substitute the data at levels a = 0.05 and a = 0.03 respectively, to obtain
p e (4.412, 5.854),   p e (4.321, 5.945).
Therefore, at the confidence level of 3%, the opinion seems to have become worse. This corresponds to our intuition that the sample deviation S = 1.302, which is significantly smaller than a = 2 from the previous case, should be essential for our thinking.
10.3.5. Likelihood of estimates. From the mathematical point of view, interval and point estimates are simple and easy to understand. It is much worse with their practical interpretation because it is problematic to verify all assumptions about randomness of the sample. With more complicated cases, we consider problems with the "likelihood" of our estimates.
As mathematicians, we can avoid the practical problem by denning the missing concept. In general, one works with a random sample of size n. Implicitly it is assumed that there are independent random variables X_ with the same probability distribution which depends on an unknown parameter 6 (a vector in general).
We are trying to find a sample statistic T, i.e., a function of the random variables X\, X2,... which, in a mathematical sense, estimates the actual value of the parameter 6. T is said to be an unbiased estimator of 6 if and only if E T = 6. The expected value E(T — 6) is called the bias of estimator T.
The asymptotic behaviour of the estimator, that is, what it does as n goes to infinity is often of interest. T = T(n) is
723
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
The realistic method is to use the same assumption for the a priori distribution of the probabilities pj of accident rates of individual drivers. In practice, one often uses a model with the Poisson distribution Po(Aj) for the j-th driver, with further assumptions about the distribution of the parameter A among the drivers. We can also assume quite well (and simply) that the distribution is pj ~ fi(a, 6) with suitable parameters a, b which should reflect the cumulated results of all drivers. Thus, let us go this way.
We know from the above exercise that the a posteriori probability distribution will be (j>j\Sj = k) = fi(a + k,b + n — k), so the corresponding expected value will be
a + k
Pi =-;-•
■'     a + b + n
Let us compare this estimate to the common estimate p mentioned above and the individual estimate pj. We introduce the values po = i. e., the expected value of the a priori common distribution for all drivers, and n0 = a + b. We get
said to be a consistent estimator of the parameter 0 if and only if T(n) converges in probability to 0, i.e., for every e > 0,
lim P(\T(n) -0\<e) = l.
is Chebyshev's inequality immediately yields
P(\T(n)-ETn\<e) > 1-
■T(n)
Assuming lim^oo E T(n) values of n,
then, for sufficiently large
var T(n)
P(\T{n)-6\ < 2e) > P(\T(n)-ET(n)\ < e) > 1 A useful proposition is thus proved :
Theorem. Assume that limn_j.00 ET(n) = 9 and limn_j.00 var T(n) = 0. Then, T(n) is a consistent estimator of 9.
As a simple example, we can illustrate this theorem on variance:
ö* = ±±{Xi-X)2 =
n-l
Since it is known from subsection 10.3.2 that S2 is an unbiased estimator, it follows that a2 is not. However,
P)
(a + b)a
nk
uq n -po+-
(a + b + n)(a + b)    (a + b + n)n     tiq + n' "   tiq + which is a linear combination of the expected value po and the individual estimate pj.
Thus, it only remains to reasonably estimate the unknown parameters a,b. We know that
EXji=EE(Xji\p)=Ep = p0
-Pi'
a = a , and it can be calculated that
lim var a2 = lim var S2 =
lim
2a
= 0.
Evar(XJ2|p)     E{p{\ - p))
a + b = riQ
varE(Xji\p) v&rp and the left-hand variables can be estimated directly.
1 N
1   N n Evar(XJ2|p) ~ — Y^i^~[P^1 ~ ^))
1    N n
vaiE(Xji\p) ~ sp
where s? denotes the sample variance between individual estimates (you can verify that the subtraction of the right-most expression guarantees that the last estimate is unbiased).
Since for the mentioned data, we get no ~ 3.8643 and p0 ~ 0.1450, the Bayesian estimate of the individual probability of accidents is
p) = 0.154 -0.145 + 0.846 -pr
n—too n—too n—too fl — 1
Therefore, the statistic s2 is a consistent estimator of the variance.
It is apparent that there may be more unbiased estima-jjj i tors for a given parameter. For instance, it is already shown that the arithmetic mean X is an unbiased estimator of the expected value 9 of random variables X{. f ^ The value X\ is, of course, an unbiased estimator of 9 as well. We wish to find the best estimator T in the class of considered statistics, which are unbiased or consistent. Consider as best the one whose variance is as small as possible. Recall that the variance of a vector statistic T is given by the corresponding covariance matrix, which is, in case of independent components, a diagonal matrix with the individual variances of the components on the diagonal. We have already denned inequalities between positive-definite matrices.
10.3.6. Maximum likelihood. Assume that the density function of the components of the sample is given by a function f(x,9) which depends on an unknown parameter 9 (a vector in general). By the assumed independence, the joint density of the vector (X1,..., Xn) is equal to the product:
f{x1, ...,xn,9) = /(xi, 0) ■ ■ ■ f(xn, 0),
which is called the likelihood function.
We are interested in the value 0 which maximizes the likelihood function on the set of all admissible values of the parameter. In the discrete case, this means choosing the parameter for which the obtained sample has the greatest probability.
724
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Thus, it is a combination of the confidence estimate p = 0.145 of the collective probability po with the individual (fre-quentist) estimate pj, which is measured from a small number n = 10 of observations of one driver. □
L. Processing of multidimensional data
Sometimes, we need to process multidimensional data: for each of n objects, we determine p characteristics. For instance, we can examine marks of several students in various subjects.
10.L.1. In his attempts, J.G.Mendel examined 10 plants of pea, and each was examined for the number of yellow and green seeds. The results of the experiment are summarized in the following table:
It follows from the genetic models that the probability of occurrence of the yellow seed should be 0.75 (and 0.25 for green seed). At the asymptotic significance level 0.05, test the hypothesis that the results of Mendel's experiments are in accordance with the model.
Solution. We test the hypothesis with the Pearson's chi-squared test. We use the statistic
(rij - nPj)
npj
where r is the number of sorting intervals (measurements; we have r = 10), rij is the actually measured frequency in the chosen sorting interval (we will count the number of yellow seeds), pj is the expected frequency (by the assumed distribution), in our case, pj = 0.75, j = 1,..., 10. If the results of the experiment were really distributed as assumed in our model, we would have if « x2^- 1 - p), where p is the number of estimated parameters in the assumed probability distribution. In our case, it is especially simple, since our model does not have any unknown parameters, so we have p = 0 (the parameters may occur if, e. g., we assume that the probability distribution in our experiment is normal but with unknown variance and expected value; then we would have p = 2). Thus, K « x2(9). The statistic is recommended to be used if the expected frequency of the characteristic in each of the sorting intervals is at least 5.
We usually work with the log-likelihood function
n
l{xi, ...,xn,0)= \nf(x!,. ..,xn,0) = y^\nf(xi,9).
i=l
Since the In function is strictly increasing, maximization of the log-likelihood function is equivalent to maximization of the original likelihood function. If, for some input, it happens that f(xi,..., xn, 9) = 0, &est£(xi,..., xn, 9) = — oo.
In the case of discrete random variables, use the same definition with probability mass function instead of the density, i.e.,
£(xx
■ 1 •^"Tll
J2HP(^=x,\9)).
We can illustrate the principle on a random sample from the normal distribution N(/i, a2) with size n. The unknown parameters are \i or a, or both. The considered density is
fix, u, a) = -= e _
V        ' V^a2
plant number	1	2	3	4	5	6	7	8	9	ttk
yellow seeds	25	32	14	70	24	20	32	44	50	44
green seeds	11	7	5	27	13	6	13	9	14	18
total seeds	36	39	19	97	37	26	45	53	64	62
1 1 "
£(x, fi, a) = -n- fn(27Tö-2) - ^ ^(xi - ß)
The maximum can be found using differentiation (note that a2 is treated as a symbol for a variable):
dp,
d£ da
2a
2a2 1
2J^2y2
^      n 1
i=l
Xi - llf =
d£ 7i 1
-na2 + y2(Xi ~
Thus, the only critical points are given by [i = X and a2 = s2. Substitute these values into the matrix of second derivatives, to obtain the Hessian of £:
0
0
"2(52)2,
Finally, this is the required maximum, and since there is only one critical point, it must be the global maximum (think about the details of this argument!).
Thus it is verified that the expected value and the variance are the most likely estimates for fi and a, as already used.
10.3.7. Bayesian estimates. We return to the example from subsection 10.3.4, now from the point of view of Bayesian statistics. This totally reverses the approach: the collected data X1,..., X15 (i.e., the points which express how much each student is satisfied, using the scale 1 through 10) are treated as constants. On the other hand, the estimated parameter fi (the expected value of the points of satisfaction), is viewed as the random variable whose distribution we wish to estimate. Let us look at the general principle first (and come back to this example soon).
725
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Let us write the data into a table:
3 1 2	nj 25 32	Pj 0.75 0.75	nPj 27 29,25	(rij-npjY
				0.148148 0.258547
10	44	0.75	46.5	0.134409
The value of the statistic K for the given data is
K = 0.148148 + 0.258547 + ■ ■ ■ + 0.134409 = 1.797495.
This value is less than Xo.9s(9) = 16.9, so we do not reject the null hypothesis at level 0.05 (i. e., we do not refute the known genetic model).
□
For this purpose, let us interpret Bayes' formula for conditional probability on the level of probability mass functions or probability densities, in the following way: If a vector (X, 0) has joint density f(x, 9), then the conditional probability of a component 0, given by X = x, is denned as the density
g(9\x)
fix)
where j(x) and g{9) are the marginal probability densities. (In the above example, x is 15-dimensional vector coming from the multidimensional normal distribution while 9 = \i is scalar.)
Thus, given the a priori probability density g(9) of the estimated parameter 9 and the probability density f(x\6) (in the above example, 9 = \i is the expected value parameter of the distribution), the formula to compute the a posteriori probability density g(6\x) can be used, based on the collected data. Indeed, we do not need to know j(x) for the following reason: we have to view f(x) as a constant independent of 9 and thus the proper density is obtained from j(x\9)g{9) by multiplying with a uniquely given constent in the end. Thus, during the computation, it is sufficient to be precise "up to a constant multiple". For this purpose, use the notation Q oc R, meaning that there is a constant C such that the expressions Q and R satisfy Q = CR.
We shall illustrate this procedure on a more explicite example. In order to be as close as possible to the ideas from subsection 10.3.4, work with normal distributions N(/i, a2). Suppose that the satisfaction of individual students in particular lectures is a random variable X ~ N(t9, a2), while the parameter 9 reached by the particular lecturers is a random variable 9 ~ N(a, b).
Compute, (up to a constant multiple, ignoring all multiplicative components which do not include any 9),
(x - 6)2     (9 - a
g(9\x) oc f(x\9)g(9) oc exp( -oc exp
1
n
a exp( -\
*/1 1
a2 b2
2a2 29
2b2
x a ~J2 + ti2
b2x + a2a a2b2
b2a2
a2b2    b2 + <r2J \b2+<r2 This proves already that the distribution for 9 is
b2 G2 b2G2
9 ~ N
b2 + a'
b2 + a2 ' b2 + a<
This result can be interpreted so that if the parameters a, b, a are known from long-run evaluation of surveys and the opinion of another student is learned, then the a priori opinion about the parameters for an individual lecture can be adjusted.
In the resulting estimate, the expected value is given by the weighted average of the found value x and the a priori assumed expected value a, in dependence on the standard deviations a and b.
726
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.3.8. Interpretation in Bayesian statistics. We follow the ideas from the previous subsection, compared to the frequentist interpretation from 10.3.4. It may seem odd that a single query can influence an opinion so much. For a —> 0, the relevance of a single opinion is still increasing, and this corresponds to a 100% relevance of x in the case a = 0. This is in accordance with the interpretation that Bayesian statistics is the probability extension of the standard discrete mathematical logic. If the variance a is close to zero, then it is almost certain that the opinion of any student precisely describes the opinion of the whole population.
In subsection 10.3.4, we worked with the sample mean X of the collected data. This can be used in the previous calculation, since the mean also has a normal distribution, too. The expected value is the same, and the only difference is that a2 jn is substituted instead of a2. To facilitate the notation, define the constant
_ nb2 n     nb2 + a2
The a posteriori estimate for 6 based on the found sample mean X has the distribution with parameters
6 ~ N(c„X + (1 - cn)a, cna2/n).
As could be expected, for increasing n, the expected value of the distribution for 6 approaches the sample mean, and its variance approaches zero. In other words, the higher the value of n, the closer is the point estimate from the frequentist point of view.
A contribution of the Bayesian approach is that if the estimated distribution is used, questions of the kind: "What is the probability that the new lecturer is worse than the old one?" can be answered. Use the same data as in 10.3.4 and supplement the necessary a priori data. Assume that the lecturers are assessed quite well (otherwise, they would probably not be teaching at the university at all). For concreteness, select the a priori distribution with parameters a = 7.5, b = 2.5, and the standard deviation with a = 2. Continue with n = 15 and the sample mean of 5.133. Substitute this data, to get the a posteriori estimate for the distribution
9 ~ N(5.230, 0.256).
We are interested in P(6 < 6). This is computed by evaluating the distribution function of the corresponding normal distribution for the input value 6 (Excel is capable of this, too). The answer is approximately 93.6 %. This is similar to the material in subsection 10.3.4, where the known variance is assumed constant.
Note the influence of the a priori assumption about the distribution of the parameter 6 for all lecturers. To a certain extent, this reflects a faith that the lecturers are rather good. If a statistician has a reason for assuming that the actual expected value a for a specific lecturer is shifted, say a = 6 as in the survey about the previous lecturer, (this can be caused,
727
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
for example, by the fact that the lecture is hard and unpopular), then the probability of his actual parameter being less than 6 would be approximately 95.0 %. (If the expected value is considered to be significantly worse only when below 5.5, then the value would be only approximately 75 %). When substituting a = 5, the value is already 96.8 %. The variance b2 is also important. For instance, the a priori estimate a = 6, b = 3.5 leads to probability 95.2 %.
In the above discussion, another very important point is touched on - sensitivity analysis. It would seem desirable that a small change of the a priori assumption has only a small effect on the a posteriori result. It appears that this is so in this example; however, we omit further discussion here.
The same model with exponential distributions is used in practice when judging the relevance of the output of an IQ test of an individual person. It can also be used for another similar exam where it is expected that the normal distribution approximates well the probability distribution of the results. In both cases, there is an a priori assumption to which group he/she should belong. Other good examples (with different distributions) are practical problems from insurance industry, where it is purposeful to estimate the parameters so that both the effects of the experiment upon an individual item and the altogether expectations over the population are included.
10.3.9. Notes on hypothesis testing. We return to deciding whether a given event does or does not occur in the context of frequentist statistics. We build on the approach from interval estimates, as presented above.
Thus, consider a random vector X = (X1, Xn) (the result of a random sample), whose joint distribution function is Fx(x). A hypothesis is an arbitrary statement about the distribution which is determined by this distribution function. Usually, one formulates two hypothesis, denoted H0 and Ha-The former is traditionally called null hypothesis, and the latter is called alternative hypothesis. The result of the test is then a decision based on a concrete realization of the random vector X (a test) whether the hypothesis H0 is to be rejected or not in favor of the hypothesis Ha ■
During this process, two types of errors may occur. Type
I error occurs when H0 is rejected even though it is true. Type
II error occurs when H0 is not rejected although it is false. The decision procedure of a frequentist statistician consists of selecting the critical region W, i.e., the set of test results when the hypothesis is rejected. The size of the critical region is chosen so that a true hypothesis is rejected with probability not greater than a. This means that a fixed bound for the probability of the type I error is required: the significance level a. The most common choices are a = 0.05 or a = 0.01. It is also useful in practice to determine the least possible significance level p for which the hypothesis is rejected; the p-value of the test.
It remains to find a reasonable procedure for choosing the critical range. This should hopefully be done so that the
728
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
type II error occurs as rarely as possible. Usually, it is convenient to consider the likelihood function f(x,6), defined for a random vector X in subsection 10.3.6. For the sake of simplicity, assume there is a one-dimensional parameter 6, and formulate the null hypothesis as X being given by the function f(x,80), while the alternative hypothesis is given by the distribution f(x, 6i) for fixed distinct values 80 and 81. Ideas about rejecting or accepting the hypotheses suggest that when substituting the values of a specific test into the likelihood function, the hypothesis can be accepted if f(x, 80) is much greater than f(x, 6i).
This suggests considering, for each constant c > 0, the critical range
Wc = {x; f{x,9i)>cf{x,90)}. Having chosen the significance level, choose c so that
/   f(x,60) = a.
This guarantees that for the test result x e Wc, when H0 is valid, the type I error occurs with at most the prescribed probability. This can also be guaranteed by other critical ranges W which also satisfy
/  f(x,60) = a. Jw
On the other hand, type II errors are also of interest. That is, it is desired to maximize the probability of Ha over the critical range. Thus, consider the difference
D= f  /(z,0i)- / /(Mi) Jwc Jw
for arbitrary W as above. The regions over which integration is carried out, can be divided into the common part W n Wc and the remaining set differences. The contributions of the common part are subtracted, and there remains
D= f      /(Mi) - / /(Mi)-Jwc\w Jw\wc
Using the definition of the critical range Wc, (again, put back the same integrals over the common part)
D>cf      f(x,e0)-cf      /(Mo) = Jwc\w Jw\wc
= c      f(x,60)-c      f(x,60) = ca-ca = 0. Jwc Jw
Thus is proved an important statement, the Neyman-Pearson lemma:
Proposition. Under the above assumptions, Wc is the optimal critical range which minimizes occurrence of the type II error at a given significance level.
729
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
10.3.10. Example. The interval estimate, as illustrated on an example in subsection 10.3.4, is a special case of hypothesis testing, when H0 had the - form "the expected value of the satisfaction with the course remained no"< while Ha says that it is equal to a different value hi- The general procedure mentioned above leads in this case to the critical range given by
= X - up
> z(a/2).
>
a
Note that in the definition of the critical range, the actual value Hi is not essential. In the context of classical probability, the decision at a given level a whether or not there is a change to the expected value fi is thus formalized.
To test only whether the satisfaction is decreased, assume beforehand that hi < ho- We analyze this case thoroughly: The critical range from the Neyman-Pearson lemma is determined by the inequality
f(x,no,a2) Take logarithms and rearrange to obtain
2a2
2x(hi - fi0) - (nl - Ho) > — lnc' • Since hi < Ak). it follows that
- , Hi+Po .       o2 j
x <----h —-r In c = y.
2 n(Hi-Ho) For a given level a, the constant c, and thereby the decisive parameter y, are determined so that, under the assumption that Hq is true,
a = P(X < y) = P{-^^V^ < V-^V^)-By assuming that H0 is true,
Z=^^v^~N(0,l), a
so the requirement means choosing Z < —z(a), which determines uniquely the optimal Wc.
Note that this critical region is independent of the chosen value hi< and the actual value for y did not have to be expressed at all. It was only essential to assume that hi < Po-In the illustrative example from subsection 10.3.4, Hq : H = 6, and the alternative hypothesis is Ha ■ H < 6. The variance is a2 = 4. The test withn = 15 yielded a; = 5.133. Substitute this, to get the value z = 5-1343"6       = -1.678, while -2(0.05) = -1.645.
Therefore, reject the hypothesis at the level of 5 %, deducing that the students' opinions are really worse.
If, for the critical range, the union of the critical ranges for the cases hi < Po and hi > Po are chosen, the same results as for the interval estimate are obtained, as mentioned above.
We remark that in the Bayesian approach, it is also possible to accept or reject hypotheses in a direct connection to
730
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
the a posteriori probability of events, as was, to certain extent, indicated in subsection 10.3.8 where our specific example is interpreted.
10.3.11. Linear models. As is usual in the analysis of math-, J <„ ematical problems, either we deal with linear depen-[Jrjfe dencies and objects, or we discuss their linearizations.
r\j/¥ In statistics, many methods belong to the linear mod-W-1   els, too. We consider a quite general scheme of this
type.
Consider a random vector Y = (Y1,..., Yn)T and suppose
Y = X ■ p + <tZ,
where X = (xij) is a constant real-valued matrix with n rows and k < n columns, whose rank is /.-, >' is an unknown constant vector of k parameters of the model, Z is a random vector whose n components have distribution N(0,1), and a > 0 is an unknown positive parameter of the model. This is a linear model with full rank.
In practice, the variables often known. The prob-
lem is to estimate or predict the value of Y. For instance, iy can express the grade in maths of the i-th student in the j-th semester (j = 1,2,3), and we want to know how this student will fare in the fourth semester. For this purpose, the vector (3 needs to be known. This can be estimated using complete observations, that is, from the knowledge of Y (from the results of past years, for example).
In order to estimate the vector (3, the least squares method can often be used. This means looking for the estimate b e Rk for which the vector Y = Xb minimizes the squared length of the vector Y — Xj3.
This is a simple problem from linear algebra, looking for the orthogonal projection of the vector Y onto the subspace span X c K™ generated by the columns of the matrix X. This is minimizing the function
n k
\\Y-xr3f = J2{Yi-J2xvh)2-i=l j=l
Choose an arbitrary orthonormal basis of the vector subspace span X and write it into columns of the matrix P. For any choice of basis, the orthogonal projection is realized as multiplication by the matrix PPT. In the subspace spanX, the mapping given by this matrix is the identity. That is,
Y = PPTY = PPT(Xf3 + aZ) =Xf3 + aPPT Z.
The matrix PPT is positive-semidefinite. Extend the basis consisting of the columns of P to an orthonormal basis of the whole space Rn. In other words, create a matrix Q = (PR) by writing the newly added basis vectors into the matrix R with n — k columns and n rows. Denote by V = PTZ and U = RTZ the random vectors with k and n — k components, respectively. They are orthogonal, and their sum in R™ is the vector (VT UT)T = QTZ
731
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Clearly (see subsection 10.2.46), both vectors V and U have multivariate normal distribution with zero i, expected value and identity covariance matrix. The random vector Y is decomposed to the sum of a constant vector X/3 and two orthogonal projections
Y = X/3 + aPV + aRU,
and the desired orthogonal projection is the sum of the first and second summands. In subsection 10.2.46, the distribution of such random vectors is also derived.
The size of \\Y — Y\\2 is called the residual sum of squares, sometimes denoted by RSS. Also, the residual variance is denned as
g2 = \\Y-Xb\\2 ^ n — k
Recall that Y = Xb and that X7 X is invertible as the full rank of X is assumed. Thus b = {X7 X) ~ 1XT Y can be computed. At the same time, X7 (Y - Y) = aX7 (RU) = 0, since the columns of X and R are mutually orthogonal. Therefore,
(1) b= (XTX)-1XTY.
The chosen matrix P can be used with advantage. Since its columns generate the same subspace as the columns of X, there is a square matrix T such that X = PT (its columns are the coefficients of linear combinations which are expressed by the columns of X in the basis of P). Substitute (using the fact that PTP is the identity matrix and T is invertible):
b = (TTPTPT)-1TTPTY =
= T--i-rTTylTTpT^pTp + aZ) =
= /3 + aT-1V. Thus is proved the main properties of the linear model:
Theorem. Consider a linear model Y = X(3 + aZ.
(1) For the estimate of Y,
Y = X/3 + aPV,   Y ~ N(X/3, a2PPT).
(2) The residual sum of squares and the normed square of the residue size have distributions:
Y-Y~ N(0, a2RRT),    \\Y - Y\\2/a2 ~ X2n_k.
(3) The random variable b = f3 + oT~xV has distribution
b~N(f3,<r2(XT X)-1).
(4) The residual variance satisfies (n — k'jS2 /a2 ~ Xn-k-
(5) The expected value of the residual variance is E S2 = a2.
(6) The variables b and S2 are independent.
Proof. Both the shape and distribution of Y are determined. It is clear that Y — Y = aRU, which verifies the second proposition. Further,
\\Y -Y\\2/a2 = \\RU\\2 = \\U\\2,
732
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
where the last equality follows from the fact that in the construction, U is the vector of coordinates of the projection Z onto the complement of span X, and RU is this projection. The size of a vector is exactly the sum of squares of its coordinates in any orthonormal basis.
Therefore, the random vector 11Y — Y | |2/a2 is the sum of (n—k) squares of random variables with distribution N(0,1), so it is the distribution xn-k< which proves the rest of (2).
The next proposition follows directly from the definitions and calculations, it suffices to estimate the covari-ance matrix for b. From the general properties, it should be the matrix T~1(TT)~1. This is the same as (A^X)"1 = ((PT)T(PT))"1.
The proposition (4) is a reformulation of the information in (2). The next proposition follows from the fact that the expected value of the x2 distribution equals the number of degrees of freedom.
Finally, independence of the variables b and S is a consequence of the fact that the former variable is a function of the vector V, while the latter one is a function of the vector U. These vectors are independent since they are two complementary parts from an orthogonal transformation of the vector Z. □
In practice, the hypothesis whether fewer parameters are sufficient to estimate the expected value is sometimes tested. A random vector Y is said to satisfy a submodel if and only if both Y = X/3 + aZ and
Y = X° /5° + aZ,
where X° has only q < k columns. It is assumed that the columns of X° generate a subspace in spanX, i.e., all are linear combinations of the columns of X.
Repeat the above construction, choosing the matrix P so that the first q vectors of P generates span X°. The matrix P is then of the form (P° P1), and the vector V decomposes similarly:
V_{V°\_{(P°)TZ
This yields a finer decomposition of the vectors and their sizes and the corresponding residues:
Y° = P°(P°)TY = X°/5° + aP°V° Y — Y° = <tP1V1 + aRU Hy-yof^ll^f + ^llt/ll2 (RSS°-RSS)/o-2 = ii^f.
Therefore, the normed difference of the residues has distribution xt-q- I* follows immediately that the statistic F given as the relative difference of the residues has Fisher-Snedecor distribution:
(RSS°-RSS)/(fc-g)
F -
RSS/(n-fc)
733
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
In practice, the parameter a is seldom known, and so the estimate S2 is used. Instead of the individual components
bj ~ N(Pj,a2Cjj) of the random vector b, where cjj are the diagonal entries of the matrix C = (X7 X) ~1, work with the statistics
bj - /3,
J-j - ~ ln-k ■
Of course, these variables need not be independent.
If the full rank of the matrix X is not assumed, a pseu-doinverse matrix can be used instead of C = (X7" X) ~1.
10.3.12. Examples of tests. As an illustration, we mention _jgg^ some examples of application of linear models in the simplest types of tests. The most trivial ^^^S— case is when there is only one sample. Here the test is whether or not the only parameter (3 equals a given value (30.
For this case, choose the matrix X as a single column consisting of ones. Then, the expression
Y = X/3 + aZ
indicates that the individual components in Y are independent variables Y{ ~ N(/3, a2). It is a random sample of size n from the normal distribution. In general, the estimate
1 "
b=(X7X)-1X7Y = -YJYt = Y
71 i=l
^-WY-XY^^^-^Y-Y)2,
71—1 71—1 ^
s2 =
which respectively is exactly the sample mean and variance used before.
In this context, the statistic
S v
may also be of interest.
Testing the hypothesis (3 = f30 is called the one-sample t-test. The hypothesis is rejected at level a if |T| > tn-i(a).
There is another simple application of the general model, which is called the paired t-test. It is appropriate for cases when pairs of random vectors W\ = (Wn) and W_ = (W7^) are tested. The differences Y{ = Wn — Wj2 of their components have distribution N(/3, a2). In addition, the variables Yj need to be independent (which does not mean that the individual pairs Wn and Wa have to be independent!). In the context of our illustrative example from 10.3.4, we can imagine the assessment of two lecturers by the same student.
Test the hypothesis that for every i, E Wn = E Wi2. Thus, use the statistic
T = Wx - W2
S
Finally, we consider an example with more parameters. It is a classical case of the regression line.
734
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
Assume that the variables Yt, i = 1,..., n have distribution N(/30 + PiXi, a2), where x{ are given constant. Examine the best approximation
Yi = b0 + hxi,
and the matrix X of the corresponding linear model is
XT=,1       1      ••• 1
Kx1   x2   ... xn Substitute into general formulae, and compute the estimate
&o\     ( n        nx    \   V tiY bi)~\nx   Ya=ixV \Ya=i^y
It follows that
6i_S?=i(^-*)(ri-y)
Finally, compute b0 = Y — b\x. From the calculations,
n
var bi = a2 J Y^(Xi ~ x)'''■
i=l
In order to test the hypothesis whether the expected value of a variable Y does not depend on x, that is, whether or not .Ho is of the form (3\ = 0, use the statistic
h ( n \ V2
T = 17 f Yl(Xi ~ ^ j ~t«-2-
The statistical analysis of multiple regression is similar. There are several sets of values iy to evaluate the statistical relevance of the approximation
Yi = b0 + bxxxi H-----h bkxki.
The individual statistics I) allow for a t-test of dependence of the regression on the individual parameters. Software packages often provide a parameter which expresses how well the values Yi are approximated. It is called the coefficient of determination:
RSS
R2 = 1
J2ti(Y*-Y)2
10.3.13. In practice, problems are often met where the distributions of the statistical data sets are either completely unknown or errors are assumed in .' the model, together with deviations with nonzero expected value and a non-normal distribution. In these cases, application of classical frequentist statistics is very hard or even totally impossible.
There are approaches which work directly with the sample set. Then derive statistics of point or interval estimates or probability calculations about the above, including the evaluation of standard errors.
One of the pioneering articles of this topic is the brief work of Bradley Efron of Stanford University, published in
735
CHAPTER 10. STATISTICS AND PROBABILITY THEORY
1981: Nonparametric estimates of standard error: The jack-knife, the bootstrap and other methods^. The keywords of this article are: Balanced repeated replications; Bootstrap; Delta method; Half-sampling; lackknife; Infinitesimal jack-knife; Influence function.
The procedure used in the bootstrap method uses software resources, created from a given data sample, and new data samples of the same size (with replacement). The desired statistics (sample mean, variance, etc.) is then examined for each of them. After a great number of executions of this procedure, a data sample is obtained which is considered a relevant approximation of the probability distribution of the examined statistic. The characteristics of this data set is considered a good approximation of the characteristics of the examined statistics for point or interval estimates, analysis of variance, etc. There is not enough space for a more detailed analysis of these techniques, which is the foundations of non-parametric methods in contemporary software statistical tools.
736
4Biometrika (1981), 68, 3, pp. 589-99
CHAPTER 11
Number theory
God created the integers, all else is the work of man. Leopold Kronecker
A. Basic properties of divisibility
Divisibility of natural numbers. Let us recall the basic properties of divisibility, whose proof follows directly from the definition: the integer 0 is divisible by every integer; the only integer that is divisible by 0 is 0; every integer a satisfies a | a; every triple of integers a,b,c satisfies the following four implications:
In this chapter, we will deal with problems concerning integers, which include mainly divisibility and solving equations whose domain will be the set of integers (or natural numbers) (in this chapter, unlike in the other parts of this book, we will not consider zero to be a natural number, as is usual in this field of mathematics). Although the natural numbers and the integers are, from a certain point of view, the simplest mathematical structure, examination of their properties yielded a good deal of tough problems for generations of mathematicians. These are often problems which can be formulated quite easily, yet many of them remain unsolved so far.
We will introduce the most popular of them:
• twin primes - the problem is to decide whether there are infinitely many primes p such that p + 2 is also a prime,1
• Sophie Germain primes - the problem is to decide whether there are infinitely many primes p such that 2p + 1 is also a prime,
• existence of an odd perfect integer - i.e., the sum of whose divisors equals twice the integer,
• Goldbach's conjecture - the problem is to decide whether every even integer greater than 2 can be expressed as the sum of two primes,
• a jewel among the problems of number theory: Fermat's Last Theorem - the problem is to decide whether there are natural numbers n, x, y, z such that n > 2 and xn + yn = zn\ Pierre de Fermat formulated this problem as early as in 1637; much effort of many generations was put to this question, and it was solved (using results of various fields of mathematics) by Andrew Wiles in 1995.
a
a|b A b\c \b A a | c 0
b A b > 0
a | c,
a | b + c A a (a | b -4=>- ac a<b.
b — c, be),
The mere knowledge of these basic rules allows us to solve many problems.
ll.A.l. Determine the natural numbers n for which the integer n3 + 1 is divisible by the integer n — 1.
1. Fundamental concepts
11.1.1. Divisibility. Recall that we say that an integer a divides an integer b (or that b is divisible by a, also that b is a multiple of a) iff there exists an integer c satisfying a ■ c=b. We write this as a | b. The concept of divisibility can be defined (and its properties examined) much more generally - more can be found in 12.2.5.
^Thls question still belongs to open problems - in 2013, Yitang Zhang published a proof of a promising proposition: forsomen < 7107, there are infinitely many pairs of primes which differ by n. See Y. Zhang, Bounded gaps between primes, Annals of Mathematics, 2013.
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution. We have n3 — 1 = (n — 1)(ti2 + n + 1), so the integer n3 — 1 is divisible by the integer n — 1 for any n. If n — 1 is to divide n3 + 1 as well, it must also divide the difference (n3 + 1) — (n3 — 1) = 2 (see the second property of divisibility). Since n G N, we have n — 1 > 0. Now, n — 1 | 2 implies that ti — 1 = 1 or ti — 1 = 2, whence n = 2 or n = 3. The wanted property is thus possessed only by the natural numbers 2 and 3. □
11.A.2.   Prove that for any ueZ, the following holds:
i) a2 leaves remainder 0 or 1 when divided by 4;
ii) a2 leaves remainder 0,1, or 4 when divided by 8;
iii) a4 leaves remainder 0 or 1 when divided by 16.
Solution.
• It follows from the Euclidean division theorem that every integer a can be written uniquely in either the form a = 2k or a = 2k + 1. Squaring this leads to a2 = 4k2 or a2 = 4(k2 + fc) + 1, which is what we wanted to prove.
• Making use of the above result, we immediately obtain the statement for the (even) integers of the form a = 2k. Back then, we arrived at a2 = 4k(k + 1) + 1 for odd integers a; we get the proposition easily if we realize that k(k + 1) is surely even.
• Again, we utilize the result of the previous parts, i.e. a2 = 41 or a2 = 8£ + 1. Squaring these equalities once again, we get a4 = (a2)2 = 1Q£2 for a even, and a4 = (a2)2 = (8£ + l)2 = 64£2 + 16£ + 1 = 16(4£2 +£) + ! for a odd. ^
11.A.3.   Prove that if integers a,b G Z leave remainder 1 when divided by an m G N, then so does their product a6. Solution. By the Euclidean division theorem, there are s, t G Z such that a = sm + 1,6 = im + 1. Multiplying these equalities leads to the expression
ab = (sm + l)(tm + 1) = (sim + s + f)m + 1,
where stm+s+t is the quotient, so the remainder of ab when divided by m is equal to 1. □
It follows from the Euclidean division theorem that the greatest common divisor of any pair of integers a, b exists, is unique, and can be computed efficiently by the Euclidean algorithm. At the same time, the coefficients into Bezout's identity can be determined this way (such integers k, I that ka + lb = (a, 6)). It can also be easily proved straight from
One of the most important properties of the integers, of which we will often take advantage, is the unique Euclidean division (division with remainder).
Theorem. For any integers a G Z, m G N, there exists a unique pair of integers g£Z,r£{0,l,...,m-l} satisfying a = qm + r.
Proof. First, we will prove the existence of the integers q, r. Let us fix a natural number m and prove the statement for any a G Z. First, we assume that a is non-negative and prove the existence of the integers q, r by induction on a:
If 0 < a < Til, we can choose q = 0, r = a, and the equality a = qm + r holds trivially.
Now, suppose that a > m and the existence of the integers q, r has been proved for all a' G {0,1, 2,..., a — 1}. In particular, for a' = a — m > 0, there are q',r' such that a' = q'm + r' and r' G {0,1,..., m — 1}. Therefore, if we select q = q' + 1, r = r', we obtain a = a' + m = (q' + l)m + r' = qm + r, which is what we wanted to prove.
Now, if a is negative, then we have proved that for the positive integer —a, there are q' G Z, r' G {0,1,..., m — 1} such that —a = q'm + r'. If r' = 0, we set r = 0, q = —g'; otherwise (i.e., r' > 0), we put r = m — r'',q= —q' — 1. In either case, we get a = q ■ m + r. Therefore, the integers q, r with the wanted properties exist for every a G Z, m G N.
Now, we will prove the uniqueness. Suppose that there are integers q±, q2 G Z and ri, r2 G {0,1,..., m — 1} which satisfy a = gim + ri = g2m + r2. Simple rearrangement yields r\ — r2 = (g2 — gi)m, so m | ri — r2. However, we have 0 < r1 < m and 0 < r2 < m, whence it follows that —77i < r1 — r2 < m. Therefore, r1 — r2 = 0, and (<?2 - qi)m = 0, hence gi = g2, r1 = r2. □
The integers q and r from the theorem are respectively called the quotient and remainder of the division of a by m with remainder. The choice of this terminology seems more intuitive if we rearrange the equality a = mq + r into the form
a            r r — = q H--,   where   0 < — < 1.
mm 771
11.1.2. Greatest common divisor. One of the most needed tools of computational number theory is the algorithm for computing the greatest common divisor. Since it is a relatively fast procedure, as we are going to show, it is used very often in modern algorithms as well.
Greatest common divisor
Consider integers a, 6. An integer m satisfying both m \ a and 77i | 6 is called a common divisor of a and 6. A common divisor 77i > 0 of a and 6 which is divisible by every common divisor of the integers a, b is called the greatest common divisor of a and 6 and it is denoted by (a, 6) (or gcd(a, 6) for the sake of clarity).
The concept of the least common multiple is defined dually and denoted by [a, 6] (or lcm(a, 6)).
738
CHAPTER 11. ELEMENTARY NUMBER THEORY
the properties of divisibility that integer linear combinations of integers a, b are exactly the multiples of their greatest common divisor.
11.A.4. Find the greatest common divisor of the integers a = 10175, b = 2277 and determine the corresponding coefficients in Bezout's identity.
Solution. We will invoke the Euclidean algorithm:
10175 = 4-2277+ 1067, 2277 = 2-1067+ 143, 1067 = 7 ■ 143 + 66, 143 = 2-66 + 11, 66 = 6-11 + 0.
Therefore, 11 is the greatest common divisor. We will express this integer from the particular equalities, resulting in a linear combination of the integers a,b:
11 = 143 - 2 • 66
= 143 - 2 -(1067- 7-143)
= -2 ■ 1067+ 15 ■ 143
= -2 ■ 1067 + 15 ■ (2277 - 2 ■ 1067)
= 15 ■ 2277 - 32 ■ 1067
= 15 ■ 2277 - 32 ■ (10175 - 4 ■ 2277)
= -32 ■ 10175 + 143 ■ 2277.
The wanted expression in the form of Bezout's identity is thus
11 = (-32) ■ 10175 + 143-2277. □
11.A.5. Find the greatest common divisor of the integers 249 — 1 and 235 — 1, and determine the corresponding coefficients in Bezout's identity.
Solution. Again, we use the Euclidean algorithm. We get:
249 -l = 214(235 -l) + 214-l,
235 - 1 = (221 + 27)(214 - 1) + 27 - 1,
214 - 1 = (27 + 1)(27 - 1).
The wanted greatest common divisor is thus 27 — 1 = 127. Let us notice that 7 = (49,35) - see also the following exercise 11.A.6. Reversing this procedure, we  find  the  coefficients  k,£  into  Bezout's identity
It follows straight from the definition that for any a, b e Z, we have (a, 6) = (6, a), [a, 6] = [6, a], (a, 1) = 1, [a, 1] = |a|, (a,0) = |a|, [a, 0] = 0.
So far, we have not shown that for every pair of integers a, b, their greatest common divisor and least common multiple exist. However, if we assume they exist, then they are unique because every pair of non-negative integers k, I satisfy (directly from the definition) that k | / and / | k imply k = I. However, in the general case of divisibility in integral domains, the situation is more complicated - see 12.2.9. Even in the case of the so-called Euclidean domains,2 which guarantee the existence of greatest common divisors, the result is determined uniquely up to the multiplication by a unit (an invertible element) - in the case of the integers, the result would be determined uniquely up to sign; the uniqueness was thus guaranteed by the condition that the greatest common divisor be non-negative.
Theorem (Euclidean algorithm). Let a1,a2be positive inte-
gers. For every n > 3 such that a„_i 0, let an denote the remainder of the division of a„_2 by a„_i. Then, after a finite number of steps, we arrive at ak = 0, and it holds that ak-i = (ai,a2).
Proof. By the Euclidean division, we have a2 > a3 > a4 > .... Since these are non-negative integers, this decreasing sequence cannot be infinite, so we get = 0 after a finite number of steps, where ak-i =^ 0. From the definition of the integers an, it follows that there are integers q\, q2,..., qk-2 such that
a-i = 9i ' a2 + a3, a2 = 92 ' «3 + «4,
ak-3 = 9fc-3 ' ak-2 + O.k-1, ak-2 = qk-2 ■ O-k-1-
It follows from the last equality that ak-i | ak-2. Further, afc_i | afc_3, ..., afc_i | a2, afc_i | ai. Therefore, afc_i is a common divisor of the integers ai, a2.
On the other hand, any common divisor of the given integers ai, a2 divides the integer a3 = ai — q±a2 as well, hence it also divides a4 = a2 — q2a3, a5,..., and especially a^-i = afc-3 — 9fc-3afc_2. We have thus proved that a^-i is the greatest common divisor of the integers ai, a2. □
It follows from the previous statement and the fact that (a, 6) = (a, —6) = (—a, 6) = (—a, —6) holds for any a, b e Z that every pair of integers has a greatest common divisor.
Moreover, the Euclidean algorithm provides another interesting statement, which is often used.
11.1.3. Theorem (Bezout). For every pair of integers a, b, there exist integers k, I such that (a, b) = ka + lb.
Wikipedia, Euclidean domain, http://en.wikipedia.org/ wiki/Euclidean_domain (as of July 29, 2017).
CHAPTER 11. ELEMENTARY NUMBER THEORY
27 - 1 = fc(249 - 1) + £(235 - 1):
27 - 1 = (235 - 1) - (221 + 27)(214 - 1)
= (235 - 1) - (221 + 27)((249 - 1) - 214(235 - 1)) = (235 + 221 + 1)(235 - 1) - (221 + 27)(249 - 1).
Therefore, k = -(221 + 2r),£ = 235 + 221 + 1. Let us bear in mind that these coefficients are never determined uniquely. □
11.A.6. Now, let us try to generalize the result of the previous exercise, i.e., prove that it holds for any a,m,n e N, a    1, that (am - 1, an - 1) = a(m'n) - 1.
Solution. This statement follows easily from the fact that any pair of natural numbers k, £ satisfies ak — 1 | ae — 1 if and only if k | £. This can be proved by dividing the integer £ by the integer k with remainder, i.e., we set £ = kq + r, where q, r e No, r < k, and consider that
akq+r-l = (ak-l)(ak{q-^+r+ak{q-V+r+- ■ ■+ar)+ar-l
is the division of the integer akq+r — 1 by the integer ak — 1 with remainder (apparently, we have ar—1 < ak — 1). Hence we can easily see that the remainder r is zero if and only if the remainder ar — 1 is zero, which is what we wanted to prove.
□
11.A.7.   Prove that for all n e N, 25  42n+1 - Wn - 4.
Solution. This statement can be proved in several ways (the easiest one is in terms of congruences, which will be introduced a bit later). Here, we will prove it by induction and then as a consequence of the binomial theorem.
The proposition is clearly true for n = 0 (even though the problem does not ask about the situation for n = 0, we can surely prove the desired property for the set n e No, thereby simplifying the first step of the induction). As for the second step: assuming 25 | 42n_1 — 10(n — 1) — 4, then we also have 25 | 16(42n-1 - 10(n - 1) - 4) = 42n+1 - Wn -4 — 150n + 100, whence it easily follows that the desired proposition 25 | 42n+1 - Wn - 4 holds.
The second proof uses the binomial theorem. By that,
2n+l I   1 \
(5-i)2™+i= Yl (  h)52n+1-k(-l)k,
k=0   ^ '
where all of the terms of the sum, except the last two, are apparently multiples of 25, i.e., the only part of the sum which is not divisible by 25 is ^^jh1 (-\fn + 5°(-l)2n+1. In other words, 42n+1 leaves the same remainder when divided
Proof. Surely it suffices to prove the theorem for a, b e N. Let us notice that if it is possible to express integers r, s in theformr = ria+r2b, s = sia+s2b, whereri, r2, si, s2 £ Z, then we can also express
r + s = (r1 + si)a + (r2 + s2)b
in this way as well as
c ■ r = (c ■ ri)a + (c ■ r2)b
for any c e Z, and thus also any integer linear combination of the numbers r and s arising in the process of Euclidean algorithm. It follows from the Euclidean algorithm (for a1 = a, a2 = b) that we can also thus express a3 = a1 — q±a2, o-a = o,2 — q2a3, hence the integer _i =0^-3 — gfc_3afc_2, too, which is (a1,a2).
Let us emphasize that the wanted numbers k,l are not determined uniquely. □
Remark. The computation of the greatest common divisor using the Euclidean algorithm is quite fast even for relatively large integers. In our example, we will try this out with integers A, B, each of which will be the product of two 101-digit primes. Let us notice that the computation of the greatest common divisor of even such huge integers took an immeasurably small amount of time. A noticeable amount of time is taken by the computation of the greatest common divisor in the second example, where the input consists of two integers having more than a million digits. An example in the system SAGE :
sage: p=next_prime (5* 10M00) sage: q=next_prime (3* 10M00) sage:  r = next_prime (10M00) sage: A=p*q;B=q*r; sage:  time G=gcd(A,B);   print G
Time : CPU 0.00 s, Wall :  0.00 s 300000000000000000000000000000000000\ 000000000000000000000000000000000000\ 00000000000000000000000000223
time G=gcd (AM0000+1 ,BA10000+1); Time : CPU 2.47 s, Wall :  2.48 s
The Euclidean algorithm and Bezout's identity are the fundamental results of elementary number theory and form one of the pillars of the algorithms used in algebra and number theory.
11.1.4. Least common multiple.    We have ignored the ff* properties of the least common multiple /j^j""/   so ^ However, we will see that thanks to the following proposition, they can be derived from the properties of the greatest common divisor.
Lemma. For every pair of integers a, b, their least common multiple [a, b] exists and it holds that (a, b) ■ [a, b] = \a ■ b\.
740
CHAPTER 11. ELEMENTARY NUMBER THEORY
by 25 as the integer IOti + 4, which is equivalent to what we were to prove. □
Prime numbers. The theoretical part contains Euclid's proof of the infinitude of primes and deals in detail with the distribution of primes in the set of natural numbers (in some cases, however, we were forced to leave the mentioned theorems unproved). Now, we will give several exercises on this topic.
11.A.8.   For any natural number n > 3, there is at least one v\\)iv prime between the integers n and ti!.
'mV Solution. Let p denote an arbitrary prime dividing ^ the integer ti! — 1 (by the Fundamental theorem of arithmetic (11.2.2), there is such a prime since ti! — 1 > 1). If we had p < ti, then p would have to divide ti! as well, so it could not divide ti! — 1. Therefore, n < p. Since p\ (ti! — 1), we have p < ti! — 1, hence p <n\. Our prime p thus satisfies the conditions of the problem. □
The result of this exercise also implies the infinitude of primes (it suffices to consider the sequence ciq = 3, an+x = an\ for n G N). However, this statement is very weak (compared to reality) since the constructed sequence contains only a "tiny" subset of the primes.
On the other hand, we are able to construct an arbitrarily long sequence of consecutive composite numbers, as shown by the following exercise.
11.A.9. Prove that for any natural number ti, there exist n consecutive natural numbers none of which is prime.
Solution. Let us examine the integers (ti + 1)!   + 2,
(ti + 1)! + 3, ■■■ , (ti + 1)! + (ti + 1). For any ke {2,3,n+1}, we have fc | (n+l)!,sofc | (n+l)\+k as well, thus (ti +1)! + k cannot be a prime. Therefore, there are no primes among these n natural numbers. □
11.A.10.
i) Prove that if natural numbers m, n are coprime, then so are
m2 + Tun + ?i2    and   m2 — ttiti + ti2.
ii) Prove that if odd natural numbers m, ti are coprime, then so are
t7i + 2ti    and   7ti2 + 4ti2.
Tl Tl
-k + --l =
b a
Proof. The statement is trivially true if either of the integers a, b is zero. Furthermore, we can assume that both these (non-zero from now on) integers are positive since their signs do not take any effect in the formula in question. We are going to show that q = a ■ b/(a, b) is the least common multiple of the integers a, b, which will finish the proof.
Since (a, 6) is a common divisor of a, b, both a/(a, b) and bj(a, b) are integers, hence
ab a b
Q (a,b) (a,b) (a,b) is a common multiple of a, b. By Bezout's identity, there are integers k, I such that (a, 6) = ka + lb. Let us suppose that ti G Z is an arbitrary common multiple of the integers a, b. We want to show that it is divisible by q. We thus have n/a,n/b G Z, hence the number
n(ka + lb)     n(a, b) n ab ab q
is an integer as well. However, this means that q | ti, which is what we wanted to prove. □
11.1.5. Coprime integers. Analogously to the case of two jg,   integers, we can also define the greatest common (§3!'   divisor and least common multiple of more than /    ft two integers, and it can be easily proved that
(ax,.. ., a„) = ((ai,..., a„_i), an), [a1, ...,an] = [[ax,.. ., a„_i], an].
Integers ax, a2,..., an e Z are said to be coprime (also relatively prime) iff (ax, a2,..., an) = 1. They are said to be pairwise coprime (pairwise relatively prime) iff we have (ai, aj) = 1 for every pair of indices i,j satisfying 1 < i <
j < Tl.
Remark. Let us realize that the concepts coprime and pairwise coprime are different. For example, we have (6,10,15) = 1; however, any two of the three integers 6,10,15 are not coprime.
Lemma. For any natural numbers a, b, c we have
(1) (ac, be) = (a, b) ■ c;
(2) if a | be and (a, b) = 1, then a | c;
(3) d = (a,b) if and only if there are k, I G N such that a = dk, b = dl, and (k, I) = 1.
Proof. (1) Since (a, b) is a common divisor of the integers a, b, (a, b) ■ c is a common divisor of the integers ac, be, hence (a, b) ■ c (ac, be). From Bezout's identity, we obtain k,l G Z such that (a, b) = ka + lb. Since (ac, be) is a common divisor of the integers ac, be, it divides the integer kac + Ibc = (a, b) ■ c as well. We have thus proved that (a, b) ■ c and (ac, be) are natural numbers which divide each other, hence they are equal.
(2) Let us suppose that (a, b) = 1 and a | be. From Bezout's identity again, we get k,l eZ such that ka+lb = 1,
741
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution.
i) To reach a contradiction, suppose that there is a prime p which divides both of the integers ra2 + ran + n2 and m2 — ran + n2. Then, it divides their difference 2ran as well.whencep = 2orpdividesoneoftheintegersm,n. If p = 2, then ra2 + ran + n2 is even, so the integers ra and n must be even as well, which contradicts that they are coprime. If p divides ra as well as ra2 +ran+n2, then it also divides n2, whence, by Euclid's theorem (11.2.1), it divides n as well. However, this also contradicts that ra, n are coprime. The case of p | n is analogous.
ii) Just like in the above exercise, let us suppose that there is a prime p which divides ra + 2n as well as ra2 + An2. Then, it must also divide (ra2 + An2) — (ra + 2n) (ra — 2n) = 8n2, and since p ^ 2 (if ra + 2n were even, then so would ra be), we necessarily have p | n. However, since p divides ra + 2n as well, we get p | ra, which is a contradiction. D
B. Congruences
In this paragraph, we will see in practice how wielding basic operations with congruences can improve the expressing of our reasonings about 11 various problems: We would be able to solve them without congruences, using only the basic properties of divisibility. However, with the help of congruences, our proofs will often be much shorter and clearer.
ll.B.l. Show that, For any a, b e Z, ra e N, the following conditions are equivalent:
i) a = b (mod ra),
ii) a = b + rat for an appropriate ieZ,
iii) ra | a — b.
Solution. (1) => (3) If a = q\ra + r,b = g2m + r, then a-b=(q1- q2)ra.
(3) => (2) If ra | a — b, then there is a t G Z such that ra ■ t = a — b, i.e., a = b -\- rat.
(2) => (1) If a = b + rat, then, expressing b = mg + r, it follows that a = m(g + i) + r. Therefore, a and & share the same remainder r upon division by ra, i.e., a = b (mod m).
□
11.B.2. Fundamental properties of congruences. It follows directly from the definition that the congruence modulo ra is an equivalence relation.
whence it follows that c = c(ka + lb) = kca + Ibc. Since a | be, it follows that c as well is a multiple of a.
(3) Let d = (a,b), then there are qi,q2 £ N such that a = dgi, b = <ig2. Then, by part (1), we have d = (a,b) = (dq1,dq2) = d- (gi,g2), so (qi,q2) = 1. On the other hand, if a = dgi, & = dg2, and (gi,g2) = 1, then (a, 6) = (dgi, <ig2) = d(q1,q2) = d - \ = d (again invoking part (1) of this lemma). □
2. Primes
The concept of a prime is one of the most important in W      elementary number theory. Its importance is V'M&T^'   §iven mainly by the unique factorization theo-rem, which is a strong as well as efficient tool for solving miscellaneous problems from number theory.
Prime
Every natural number n > 2 has at least two positive divisors: 1 and itself. If there are no other divisors, it is called a prime (number). Otherwise (i.e., if there exist other divisors), we talk about a composite (number).
In the subsequent paragraphs, we will usually denote primes by the letter p. The first few primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37,... (in particular, the number 1 is considered to be neither prime nor composite as it is a unit in the ring of the integers). As we will prove shortly, there are infinitely many primes. However, we have rather limited computational resources when it comes to determining whether a given number is prime: The number 2rr 232 917 — 1, which is the greatest known prime as of 2017, has only 23 249 425 digits, so its decimal representation would fit into many a prehistoric data storage device. Printing it as a book would, however, (assuming 60 rows on a page and 80 digits in a row) take 4 844 pages.
Now, let us introduce a theorem which gives a necessary and sufficient condition for being prime and is thus a fundamental ingredient for the proof of the unique factorization theorem.
11.2.1. Theorem (Euclid's on primes). An integer p > 2 is a prime if and only if the following holds: for every pair of integers a,b, p | ab implies that either p | aorp | b(orboth).
Proof. " => " Suppose that pis a prime and p | ab, where a,b eZ. Since (p, a) is a positive divisor of p, we have either (p, a) = p or (p, a) = 1. In the former case, we get p | a; in the latter, p | b by part (2) of the previous lemma.
" -<= " If p is not a prime, it has a positive divisor distinct from both 1 and p. Let us denote it by a. However, then we have b = | e N and p = ab, hence \ <a<p,\<b<p. We have thus found integers a, b such that p | ab, while p divides neither a nor b. □
742
CHAPTER 11. ELEMENTARY NUMBER THEORY
Now, we will prove further properties of congruences. Properties of congruences
i) Congruences with respect to the same modulus can be added. An arbitrary multiple of the modulus can be added to either side.
ii) Congruences with respect to the same modulus can be multiplied.
iil) Both sides of a congruence can be raised to the power of the same natural number.
iv) Both sides of a congruence can be divided by their common divisor provided it is coprime to the modulus. Both sides of a congruence together with the modulus can be divided by a positive divisor common to all three of them.
v) If a congruence is valid with respect to a modulus ra, it is also valid with respect to any modulus d which divides ra.
vi) If either side of a congruence and the modulus are divisible by an integer, then this integer must divide the other side of the congruence as well.
vii) If a congruence is valid with respect to moduli mi,..., mj;, it is also valid with respect to their least common multiple [ra1,..., rak].
Solution.
i) If a = b (mod ra) and c = d (mod ra), by the previous lemma, there are integers s, t such that a = b + ras, c = d+mt. However, then we have a + c = b+d+m(s + t), and, by the lemma again, a + c = b + d (mod ra).
Adding a congruence a = b (mod ra) to rak = 0 (mod ra), which is clearly valid, leads to a + rak = b (mod ra).
ii) If a = b (mod ra) and c = d (mod ra), there are integers s, t such that a = b + ras, c = d + rat. Then,
ac= (b + ras)(d + rat) = bd + ra(bt + ds + rast),
whence we get ac = bd (mod ra).
iii) Let a = b (mod ra) and n be a natural number. Since
an - bn = (a - &)(a™-1 + an~2b +■■■+ bn~v),
it follows that an = bn (mod ra) as well.
iv) Suppose that a = b (mod ra), a = ai ■ d, b = b\ ■ d, and (ra, d) = 1. By the lemma, the difference a — b= (a1 — bi) ■ d is divisible by the integer ra, and since
11.2.2. Fundamental theorem of arithmetic. Theorem.
'jSLi'-i'',     Every natural number n can be expressed as the product of primes, and this expression is 5    miiMi z. r* unique up to the order of the factors. (Ifn is a prime, then it is the "product" of a single prime n; ifn = 1, it is the empty product (the product of the empty set of primes )3)
Remark. As mentioned in part 12.2.5, it is possible to define divisibility in a similar way in an arbitrary integral domain. In some integral domains (e.g. Q), there are no elements with the prime property (we call them irreducible). Other integral domains have such elements, yet they do not satisfy the unique factorization theorem. It is quite similar with the generalization of aforementioned Euclid's theorem on primes -the elements which satisfy p | ab => (p | a or p | b) are always irreducible, but the contrary is not generally true. Let us mention at least one example of an ambiguous factorization -in Z(v/r5), we have:4 6 = 2 ■ 3 = (1 + v^5) • (1 - V^5)-However, it needs a longer discussion to verify that all of the mentioned factors are really irreducible in Z(v/—5).
Proof of the Fundamental theorem of arithmetic. First, we prove by complete induction on n that every natural number n can be expressed as a product of primes. We have already discussed the validity of this statement for n = 1.
Now, let us suppose that n > 2 and we have already proved that all natural numbers less than n can be factored to primes. If n is a prime, the statement is clearly true. If n is not a prime, then it has a divisor d, 1 < d < n. Denoting e = n/d, we also have 1 < e < n. From the induction hypothesis, we get that both d and e can be expressed as products of primes, so their product d- e = n can also be expressed in this way.
Now, let us have an equality of products p\ ■ p2 ■ ■ ■ ps = 9r?2 • • • Qt, where^, qj are primes for all i e {1,..., s}, j e {1,..., t}, and, further, let p\ < p2 < ■ ■ ■ < ps, qi < q2 < ■ ■ ■ < qt and 1 < s < t. We will prove by induction on s that s = t andpi = qu ... ,ps = qs.
If s = 1, then pi = qi ■ ■ ■ qt is a prime. If we had t > 1, the integer p\ would have a divisor q\ such that 1 < q\ < p\ (since q2q3 ■ ■ ■ qt > 1), which is impossible. Therefore, we must have t = \ and p\ = q\.
Now, let us suppose that s > 2 and the proposition holds for s — 1. It follows from the equality p\ ■ p2 ■ ■ ■ ps = Qi' 92 ''' qt that ps divides the product q_ ■ ■ ■ qt, which is, by Euclid's theorem, possible only if ps divides qj for an appropriate j e {1,2,..., i}. Since qj is a prime, it follows that ps = qj (since ps > 1). It can be proved analogously that
Using the terminology of ring theory, it is a unit of the ring of the integers, which is usually assumed to be the product of the empty set of elements of the ring.
4The symbol Z(V—5) denotes the integers extended by a root of the equation x2 = —5, which is defined similarly as we obtained the complex numbers by adjoining the number V-1 to the reals.
743
CHAPTER 11. ELEMENTARY NUMBER THEORY
(m, d) = 1, the integer a1 — b1 is also divisible by m (by lemma 11.1.5). Hence it follows that a 1 = b1 (mod m). Further, if ad = bd (mod md),i.e.,md \ ad—bd, we get directly from the definition of divisibility that m | a — b.
v) If a = b (mod m), then a — b is a multiple of m, and hence a multiple of any divisor d of m, so a = b (mod <i).
vi) Suppose that a = b (mod m), b = bid, m = mid. Then there is an integer t such that a = b + rat = bid + midt = (pi + mit)d, hence d | a.
vii) If a = b (mod mi), a = b (mod 7712),..., a = b (mod rrik), then the difference a — b is a common multiple of the integers mi, m2,..., mj,, and so it is divisible by their least common multiple [mi, m2,..., m^], whence it follows that a = b (mod [mi,..., m^]).
□
Remark. We have already used some properties of congruences without explicitly mentioning it - now, the result of the exercise 11.A.3 can be reformulated as "if a = 1 (mod m), b = 1 (mod m), then also ab = 1 (mod m)", which is a special case of item (2) of the previous theorem.
It is not by chance because any statement which uses congruences can be reformulated in terms of divisibility. The usefulness of congruences thus lies not in the strength to solve more problems than without them, but rather in being a very convenient way of writing which simplifies both expressions and reasonings.
11.B.3.
i) Find the remainder of the integer 730 when divided by 50.
ii) Find the last two digits of the decimal representation of the integer 730.
Solution.
i) Since 72 = 49 = — 1 (mod 50), using the properties of congruences, which are mentioned above, 730 = (—l)15 = —1 (mod 50), so the remainder of 730 upon division by 50 is 49.
ii) Our task is actually to determine the remainder of 730 upon division by 100. We know from the above that the integer 730 leaves remainder 49 when divided by 50, so the last two digits are either 49 or 99. In particular, we
qt = pi for an appropriate i e {1, 2,..., s}. Hence,
qt = Pi < ps = qj < qt,
so ps = qt. Dividing both sides of the original equality by this integer, we obtain pi ■ p2 ■ ■ ■ ps-i = qi ■ q2 ■ ■ ■ qt-i, and from the induction hypothesis, we get s — 1 = t — 1, pi = qi,... ,ps-i = qs-i- Altogether, we have s = t and pi = qi,...,ps-i = qs-i,ps = qs. This proves the uniqueness, and thus the entire theorem as well. □
11.2.3. Practical notes. As we will show, it is very compli-/gta        cated to decide for certain whether a given large ' integer is a prime (on the other hand, for most WJ,** '   composite numbers, it is really easy to prove 3Nfc~^_   that they are indeed composite - see part 11.5.4). Nevertheless, Indian mathematicians5 managed to prove in 2002 that there is an algorithm running in polynomial time with respect to the input (i.e., the number of digits of the integer in question) which decides whether the integer is a prime.
We are unable to produce such an algorithm for prime factorization (and it is widely believed that it is impossible although no one has been able to prove this so far). The fastest generally applicable factorization algorithm, the so-called general number field sieve, runs in sub-exponential
In 1994, Peter Shor invented an algorithm which factors an integer N in cubic time (i.e., runs it O (log3 N)) on a quantum computer. However, this algorithm requires computers with sufficient number of quantum bits. We can see how difficult this is from the fact that in 2001, IBM managed to use a quantum computer to factor the integer 15, and in 2012, another record was achieved by factoring the integer 143 (in fact using other approach, so-called adiabatic quantum computation).
We can find more evidence about the difficulty of this problem in the call made in 1991 by RSA Security.6 If anyone manages to factor the integers labeled by the number of digits they have as RSA-100, RSA-704, RSA-768, RSA-2048, they could receive respectively 1,000,..., 30,000, 50,000, 200,000 dollars (the integer RSA-100 was factored by Arjen Lenstra that very year; the integer RSA-704 was factored in 2012; many others have not been factored yet).
Thanks to the unique factorization theorem, we are able to (provided we know this factorization) easily answer the following questions concerning the number or sum of the divisors of a given integer. Just that easily, we can get the (intuitively well-known) procedure of computation of the greatest common divisor of two integers from their prime factorizations.
5M. Agrawal, N. Kayal, N. Saxena. PRIMES is in P. Annals of Mathematics 160 (2): 781-793. 2004.
6See http://www.rsasecurity.com/rsalabs/node. asp?id=2093.
744
CHAPTER 11. ELEMENTARY NUMBER THEORY
already know that 730 = —1 (mod 25), and we can easily calculate that 730 = (-1)30 = 1 (mod 4). Since (4,25) = 1, the wanted pair of digits is 49 (it leaves the desired remainder upon division by both 25 and 4). □
11.B.4. Prove that for any n e N, the integer 37n+2 + 16n+1 + 23™ is divisible by 7.
Solution. We have 37 = 16 = 23 = 2 (mod 7), so by the basic properties of congruences,
37n+2 + 16n+i +      = 2n+2 + 2n+1 + 2n
= 2n(4+2+l) = 0   (mod 7). □
11.B.5. Prove that the integer n = (8355 + 6)18 - 1 is divisible by 112.
Solution. We factor 112 = 7 ■ 16. Since (7,16) = 1, it suffices to show that 7 | n and 16 | n. We have 835 = 2
(mod 7), so
n = (25 + 6)18 - 1 = 3818 - 1 = 318 - 1 = 276 - 1 = (-1)6 - 1 = 0   (mod 7),
hence 7 | n. Similarly, 835 = 3 (mod 16), so
n = (35 + 6)18 - 1 = (3 ■ 81 + 6)18 - 1 = (3 ■ 1 + 6)18 - 1
91
1 = 81
9
1 = ly
1 = 0    (mod 16),
hence 16 | n. Altogether, 112 | n, which was to be proved.
□
11.B.6.   Prove that the following relations hold for any prime p:
i) If k e {1,... ,p - 1}, thenp
ii) If a, b e Z, then aP + ¥> = (a + b)p (mod p).
Solution.
i) Since the binomial coefficient satisfies
_p(p-l)---(p-k + 1)
KkJ k\ which is an integer, we hence know that k\ divides the product p(p — 1) ■ ■ ■ (p — k + 1). However, since the integer k\ is coprime to the prime p, we thus get that k\ | (p — 1) ■ ■ ■ (p — k + 1), whence it follows thatp | (I).
Proposition. Every positive divisor of an integer
a = p"1.....p^fc is of the form p\x.....pffc, where
Pi, ■ ■ ■ ,fa G N0 and Pi < ax, p2 < a2, ..., fa < ak. Therefore, the integer a has exactly
T(a) = (Qi + 1)(q2 + 1).....K + l)
positive divisors, which sum up to
'1 - 1
a{a)
Pi1+1 - 1
pT+1
Pi - 1 Pk - 1
Let pi,..., pk be pairwise distinct primes and a\, Qfc, Pi,       Pk be non-negative integers. Denoting ji = vam{ai,Pi\, Si = max{eti, Pi} for every i = 1,2,..., A;, then we have
(Pi1 ■ ■	«1. '■■Pk :	if1 • Pi
\pT ■■	Qi-■■■Pk	iPi
■Pk
Pk
71 ■Pi
Si
-Pi
Pk '
$k ■Pk-
Proof. The proofs of all of the mentioned propositions are a simple consequence of the first statement, which describes the factorizations of all of an integer's divisors. To find the number of positive divisors, we can use elementary combinatorics (the product rule) to get r (a) = (qi +1) (q2 +
1).....(a*; + 1). We can see that the formula for the sum of
the divisors holds if we rewrite it in the form
(1+P1 + ---+P?1) •••(! + ?* +•••+?£*).
realizing that every pair of parentheses contains the sum of a finite geometric series. The other statements follow directly from the definition. □
11.2.4. Perfect numbers and their relation to primes.
The sum of all positive divisors of an integer is connected to the so-called perfect numbers. We say that an integer a is perfect iff a(a) = 2a, i.e., iff the sum of all positive divisors of a, excluding a itself equals a.
Perfect numbers include, for instance, 6 = 1 + 2 + 3, 28 = 1 + 2 + 4 + 7 + 14, 496, and 8128 (this exhausts all perfect numbers less than 10 000).
It can be shown that even perfect numbers are in a tight relation with the so-called Mersenne primes since the following holds:
Proposition. A natural number a is an even perfect number if and only if it is of the form a = 2q~1(2q — 1), where 2q — 1 is a prime.
Proof. If a = 2«"1(2« - 1), where p = 2« - 1 is a prime, then the previous statement yields
a(a) = ■ (p + 1) = (2« - 1) ■ 2" = 2a.
Such an integer a is thus a perfect number.
For the opposite direction, consider any even perfect number a, and let us write
a = 2k ■ m, where m, k e N and 2\m.
745
CHAPTER 11. ELEMENTARY NUMBER THEORY
ii) The binomial theorem implies that
(a + bf = ap + (P)ap-1b+--- + (^/Jab"-1 + bp.
Thanks to the previous item, we have (£) = 0 (mod p) for any k e {I,... ,p—1}, whence the statement follows easily. □
11.B.7. Prove that for any natural numbers m, n and any integers a, b such that a = b (mod mn), it is true that
am = bm (modm"+1).
Solution. Since clearly m \ ran, we get that the congruence a = b (mod m) holds, invoking property (5) of 11.B.2. Therefore, considering the algebraic identity
m — 2  . rra—\\
-Lm-bm = (a-b)(am-i + am-zb+--- + abm-z + b
all the summands in the second pair of parentheses are congruent to a"1-1 modulo m, so
am-l + am-2h+, . . + = m . flm-l = q m)_
Since ran divides a — b, and the sum a"1-1 + am~2 + ■ ■ ■ + tfn-i js divisible by m, we get that mn+1 must divide their product, which means that am = bm (mod mn+1). □
11.B.8. Using the result of the previous exercise (see also 11.A.2), prove that:
i) integers a which are not divisible by 3 satisfy a3 = ±1
(mod 9),
ii) odd integers a satisfy a4 = 1 (mod 16). Solution.
i) Cubing the congruence a = ±1 (mod 3) (and, again, raising the exponent of the modulus), we get a3 = ±1
(mod 32).
ii) This statement was proved already in the third part of exercise 11.A.2. Now, we will present another proof. Thanks to part (ii) of the mentioned exercise, we know that every odd integer a satisfies a2 = 1 (mod 23). Squaring this (and recalling the above exercise) leads to a4 = l2 (mod 24). □
11.B.9. Divisibility rules. We can surely recall the basic -0i> rules of divisibility (at least by the numbers 2, [ ■*l*")frf 3, 4, 5, 6, 9 a 10) in terms of the decimal representation of a given integer. However, how can these rules be proved and can they be extended to other divisors as well?
Since the function a is multiplicative (see 11.3.1), we have
a [a) = <r(2k) ■ a(m) = (2k+1 - 1) ■ a(m). However, it follows from a being perfect that a(a) = 2a whence
lk+1
lk+1
= (2fc+i - 1) ■o-(m).
Since 2k+1 — 1 is odd, we must have 2k+1 — 1 m, so we can lay m = (2k+1 — 1) ■ n for an appropriate n e N.
Rearranging leads to 2
fc+i
71 :
o(m). Both Til and n divide
77i (and since ^ = 2k+1 — 1 > 1, these integers are different), hence
■ 71 :
fj(77l) > 771 + 71 :
■ 71,
and so <r(m) = 771+71. However, this means that m is a prime with the sole divisors m and 71 = 1, whence a = 2k ■ (2k+1 — 1), where 2k+1 — 1 = 771 is a prime. □
Remark. On the other hand, people have been unsuccessful in describing odd perfect numbers; we even do not know whether there exists an odd perfect number.
Mersenne primes are those of the form 2k — 1. It should not go unnoticed that Mersenne primes are easily recognizable among all primes - for Mersenne numbers (excluding the primality requirement), there is a simple and fast procedure how to verify that they are primes. It is thus not by chance that the largest known primes are usually of the form 2k — 1.
Later, we will show how to efficiently verify whether a given Mersenne number is prime (see the Lucas-Lehmer test in part 11.5.9).
Although it may seem a weird and practically useless business to look for primes as great as possible, it pushes the borders of our cognition of mathematics forward and refines the used methods (as well as hardware). Moreover, the discoverers often benefit from this (Electronic Frontier Foundation issued EFF Cooperative Computing Awards for finding a prime having at least 106,107,108, and 109 digits - rewards of 50 and 100 dollars, respectively, for the first and second categories were paid in 2000 and 2009, respectively, to the GIMPS project in both cases. Apparently, it will take a while before the other prizes are awarded.)
11.2.5. Prime distribution.
There are two facts about the distribution of prime numbers. The first is that, [they are] the most arbitrary and ornery objects studied by mathematicians: they grow like weeds among the natural numbers, seeming to obey no other law than that of chance, and nobody can predict where the next one will sprout. The second fact is even more astonishing, for it states just the opposite: that the prime numbers exhibit stunning regularity, that there are laws governing their behavior, and that they obey these laws with almost military precision.
Don Zagier
746
CHAPTER 11. ELEMENTARY NUMBER THEORY
We already know that we can restrict ourselves to divisibility by powers of primes (for instance, divisibility by 6 can be tested using divisibility by 2 and 3).
The rule for divisibility by 9 says that a given integer is divisible by 9 if and only if its digit sum is. We will prove this as a consequence of a much stronger statement: It holds that every integer is congruent to its digit sum modulo 9 (in particular, it is congruent to zero if and only if its digit sum is). And this is trivial to prove: The digit sum of an integer n = cikWk + afc_ilOfe_1 + ■■■ + ailO + a0 is equal to S(n) = + a^-i + ■ ■ ■ + a0, and since 10e = le = 1 (mod 9) for any f £ No, we get
n = akWk + ■ ■ ■ + a0 =     + ■ ■ ■ + a0 = S(n)   (mod 9).
This derivation is valid also if we replace 9 with 3.
The rule for divisibility by 11, which we have not mentioned yet, works similarly. Here, we have 10e = (—l)e
(mod 11), so we get
n = afc10fe H-----h a0 = ak(-l)k H-----h ai(-l) + a0
= (a0 + a2 H----) — (a1 + a3 H----)   (mod 11).
Therefore, an integer is divisible by 11 if and only if the difference of the sum of the digits at even places and the sum of the digits at odd places is.
There is a nice trick for the divisors 7 and 13: We have 1001 = 7-11-13; anintegern = lOOOa+frthus satisfies n = —a + b (mod m), where m is any of the numbers 7,11,13. Therefore, 2015 is divisible by 13 since 015 — 2 = 13. Similarly, 2016 is divisible by 7 as 016 — 2 = 14 is a multiple of 7. We could justify that 2013 is a multiple of 11 in the same way, but the aforementioned criterion 11 | (3 + 0) — (1 + 2) is maybe more smart.
Using divisibility for error detection. Let us note that di-visibility by eleven is often used for making decimal ^xf£(c codes which allow us to detect a single-digit error.
W If we make such a mistake when copying an integer which is divisible by eleven, then the resulting integer is surely not (see the aforementioned criterion of divisibility by eleven). More details can be found in chapter 12.4.1 about coding. For instance, the national identification numbers in the Czech Republic and Slovakia contain a check digit which completes the code into an integer divisible by eleven.
Similarly, the numbers of bank accounts managed by Czech banks must comply with a similar (only a bit
Now, we will try to answer the following questions: Are there infinitely many primes? Are there infinitely many primes in every (or at least one) arithmetic sequence? How are the primes distributed among the natural numbers?
The fundamental theorem, which we must not omit in this connection, is what Euclid around 300 BC was aware of:
11.2.6. Theorem (Euclid). There are infinitely many primes among the natural numbers.
Proof. Suppose that there are only finitely many, and let them be denoted by pi, p2, ■■■ ,pn- Set N = p\ -p2 ■ ■ .pn+l. This integer, being greater than 1, is either a prime or it is divisible by a prime different from p\,..., pn (since the primes pi,...,pn divide the integer N — 1), which is a contradiction. □
Now, we will mention a rather strong statement, whose proof is very laborious (that is why we do not present it), yet it can be done by elementary means7.
Theorem (Chebyshev's, Bertrand's postulate). For every integer n > 1, there is at least one prime p satisfying n < p < 2n.
Primes are distributed quite uniformly in the sense that in any "reasonable" arithmetic sequence (i.e. such that its terms are coprime), there are infinitely many of them.
For instance, considering the remainders upon division by 4, there are infinitely many primes with remainder 1 as well as infinitely many primes with remainder 3 (of course, there is no prime with remainder 0 and only one prime with remainder 2). The situation is analogous for remainders upon division by other integers, as explained by the following theorem, whose proof is very difficult.
11.2.7. Theorem (Dirichlet's on primes). If a, m are coprime natural numbers, there are infinitely many primes k such that mk + a is a prime. In other words, there are infinitely many primes among the integers 1 ■ m + a, 2 ■ m + a, 3 ■ m + a, ....
We can at least mention a proof of a special case of this theorem, which is a modification of Euler's proof of the infinitude of primes.
Proposition. There are infinitely many primes of the form
4k + 3, where k e N0.
Proof. Suppose the contrary, i.e., there are only finitely many primes of this form, and let them be denoted by p\ = 3, Pi = 7, p3 = 11, ..., pn- Further, let us set N = 4p2 ■ p3 ■ —pn+3. Factoring N, the product must contain (according to the result of exercise 11.A.3) at least one prime p of the form 4k + 3. If not, N would be a product of only primes of the form 4k + 1, so N as well would give remainder 1 upon
See Wikipedia,   Proof of Bertrand's postulate, http://en. wikipedia.org/wiki/Proof_of_Bertrand' s_postulate (as of July 29, 2017) or see M. Algner, G. Ziegler, Proofs from THE BOOK, Springer, 2009.
747
CHAPTER 11. ELEMENTARY NUMBER THEORY
more complicated) procedure. Both the transformed 6-digit prefix a5a4a3a2a1a0 and the 10-digit account number must satisfy the following condition on divisibility by eleven (here, we mention only the one for the number without the prefix):
0 = b929 + b82s + bT2r + ■■■+ b323 + b222 + bx2x + b02°
= -5&9 + 3bs - 4b7 - 2b6 - b5
+ 5&4 - 3&3 + 4&2 + 2&i + &0   (mod 11).
This condition can be shortly described so that the account number, perceived as being in binary (though with usage of decimal digits) is to be divisible by eleven.
11.B.10.   Verify that the account number of the Masaryk
university, 85636621/0100, is built correctly.
Solution. We will test the condition of divisibility by eleven:
- 5bg + 3bs - 4b7 - 2b6 -b5 + 5b4 - 3b3 + 4b2 + 2bx + b0 = -4-8-2-5-l-6 + 5- 3- 3- 6 + 4- 6 + 2- 2 + l- l =   0 (mod 11) . □
Euler's totient function. The totient function p assigns to a natural number m the number of natural numbers which are less than or equal to m and coprime to m, which can be written as
p(m) = \{a £ N | 0 < a < m, (a, m) = 1}|. However, to be able to evaluate it efficiently, one needs to know the factorization of the input integer m to primes. In such a case, for m = p"1 ■ ■ ■ p^k, we have
vH = (p1-i)p?i_1---(pfc-i)Pfct"1-
In particular, we know that p(pa) = (p — 1) • pQ_1 and that ip(m ■ n) = ip(m) ■ p(n) holds whenever m, n are coprime.
11.B.11.   Calculate p(72).
Solution. 72 = 23■ 32 ==> p(72) = 72■ (1 - ±) ■ (1 - i) = 24, alternatively p(72) = p(8) ■ p(9) = 4 ■ 6 = 24. □
11.B.12.
i) Determine all natural numbers n for which p(n) is odd.
ii) Prove that Vn e N : p(4n + 2) = p(2n + 1).
Solution.
i) We clearly have p(l) = p(2) = 1. Every integer n > 3 is either divisible by an odd prime p (then, p(n) is divisible p — 1, which is an even integer) or n is a (higher-than-first) power of two (and then, p(2a) = 2°
division by 4, which is not true. However, none of the primes Pi,p2,... ,pn can play the role of the mentioned p since if we had pi | TV for some i e {2,..., n), then we would get Pi | 3. Similarly, 3 \ N, and we thus get a contradiction with the assumption of finitely many primes of the given form. □
An analogous elementary proof can be used for primes of the form 3k + 2 or Qk + 5; however, it will not work for primes of the form 3k +1 or4fc + l (think this out well!). We will be able to remedy this in the latter case in part 11.4.11 about quadratic congruences).
From the propositions mentioned in this chapter, one can roughly imagine how "dense" the primes appear among the natural numbers. It is more accurately described (although "only" asymptotically) by the following, very important theorem, which was proved independently by J. Hadamard and Ch. J. de la Vallee-Poussin in 1896.
11.2.8. Theorem (Prime Number Theorem). Letn(x) denote the number of primes less than or equal to a number x G R. Then
( \ x ma;
i.e., the quotient of the functions it (x) and x / hi x approaches
1 for x —> oo.
The following table illustrates how good the asymptotic estimate n(x) ~ x/\a(x) is in several concrete instances in reality:
X	7r(a;)	a;/ln(a;)	relative error
100	25	21.71	0.13
1000	168	144.76	0.13
10000	1229	1085.73	0.11
100000	9592	8685.88	0.09
500000	41538	38102.89	0.08
The density of primes among the natural numbers is also partially described by the following result by Eu-ler.
Proposition. Let P denote the set of all primes, then
EH-
peP
Remark. On the other hand, EnGN 7^2 = which means that the primes are distributed more "densely" in N than squares.
Proof. Let n be an arbitrary natural number and pi,..., pTt(n) all primes less than or equal to n. Let us set
7r(n)
A(n) = [J
1 -
1
748
CHAPTER 11. ELEMENTARY NUMBER THEORY
as well). Altogether, we have found out that <p(n) is odd only for n = 1, 2. ii) The integer 2n +1 is odd, so (2,2n +1) = 1, and hence
ip{4n + 2) = <p(2 ■ (2n + 1)) = ip(2) ■ <p(2n + 1) = p(2n + l). □
11.B.13.   Find all natural numbers m for which:
^      i) p(m) = 30, ii) <p(m) = 34, J&y'fl     iii) <p(m) = 20, iv) <p(m) = y .
Solution. In all of the above cases, we are looking for the fibers of a given integer a in the form m = p"1 ■ ■ ■ p^k, and we proceed as follows:
. Since p(m) = (Pl - l)p?1_1 ■ ■ ■ (pk - VpF'1 = a, every prime p which divides m must satisfy
p — 1 | a.
• Similarly, every prime p whose higher power divides m must divide a.  More exactly, we must even have
pQ_1 | a.
• This procedure results in a finite set of candidates for m, which can be eliminated in a convenient way, sometimes also using the fact that any prime dividing a must occur in the factorization of <p(m) as a divisor of some p — 1 or in a prime power pQ_1.
Now, let us solve problems i)-iii):
i) Every prime p from the factorization of m must satisfy p - 1 | 30, so p - 1 G {1, 2,3,5, 6,10,15,30}, which is satisfied by primes p G {2, 3,7,11,31}, and only 2 and 3 of them can divide m in higher power than the first. Therefore,
m = 2Q3'37711'531£,
where a, /3 G {0,1,2}, 7, <5, e G {0,1}. The analysis of the possibilities can be further simplified if we realize thaty>(3) = 2, p(32) = y>(7) = 6,^(11) = 10 are all integers which divide 30 into an odd integer greater than 1. Therefore, if we had, for instance, m = 7 ■ m1, where 7 f mi, then we would also have <p(mi) = 5, which is impossible, as we know from the previous exercise.
We thus get j3 = 7 = 8 = 0 and m = 2a ■ 31£, whence we can easily obtain the solution m G {31,62}.
The particular factors can be perceived as sums of geometric series, hence
A(«)=n(£^)=£p^W.
i=l  V,=0ft   y £V(n)
where we sum over all 7t(ti)-tuples of non-negative integers (qi, ..., aT(n)). Since every integer not exceeding n factors to only primes in the set {p±,... ,p^(„)}, all of them are included in this sum. Therefore, A (71) > l + and since the harmonic series is divergent (see 5.H.3), we also have limn_j.00 A(n) = 00.
Taking into account the expansion of the function ln(l + x) to a power series (see 6.C.1), we further get
ir(n) ir(n) oo
lnA(n) = -Eln (l"^) = EE (m^)_1 =
i=l i=l m=l
7r(n) 00
i=l m=2
Since the inner sum can be bound from above by
00 00
£(mpr)-1< Eftm =
771 = 2 771 = 2
= (i-ft1)"1^-2,
the divergent sequence In A (71) < J21=n\ Pi 1 + 2 p2~ 2
can also be bounded from above. The second sum apparently converges (since the series J2n°=i n~2 does), so the first sum Y^l=i Pj1 must diverge, which is what we wanted to prove. □
3. Congruences and basic theorems
This concept was introduced by C. F. Gauss in 1801 in his book Disquisitiones Arithmeticae. Although '*-*¥//, being a simple one, its contribution to number theory is mainly in making some reasonings (even quite complicated ones) be written much more compactly and transparently.
Congruence
If integers a, b give the same remainder r (where 0 < r < m) when divided by a natural number m, we say that they are congruent modulo m and write it as
a = b   (mod m).
In the other case, we say that the integers a, b are not congruent modulo 771, writing
a ^ b   (mod m).
Whenever it is apparent that we are working with congruence relations, we usually omit the symbol mod and write just a = b (m).
749
CHAPTER 11. ELEMENTARY NUMBER THEORY
ii) Similarly to above, only primes p g {2, 3} can divide m, and the prime 3 can divide m only in the first power. However, since = 17, the prime 3 cannot divide m at all. The remaining possibility, m = 2Q, leads to 34 = 2Q_1, which is also impossible. Therefore, there is no such number m.
iii) Now, every prime p dividing m must satisfy p—1 | 20, so p — 1 g {1, 2,4,5,10,20}, which is satisfied by primes p g {2,3,5,11}, and only 2 and 5 of those can divide m in higher power. We thus have
m = 2Q3'Vll'5,
where a g {0,1,2,3}, 7 g {0,1, 2}, /3, S g {0,1}.
First, consider <5 = 1. Then, y.(2Q3'357) = 2, whence we easily get that 7 = 0 and (q,/3) g {(2,0), (1,1), (0,1)}, which gives three solutions: m g {44, 66, 33}.
Further, let us have 5 = 0. If 7 = 2, then iP(2a313) = 1, whence (a,0) g {(1,0), (0,0)}. We thus obtain two more solutions: m g {50,25}.
If 7 = 1, then we get = 5, similarly to the above item. This is an odd integer, so we get no solutions in this case. This is also the case for 7 = 0 since the equation ip(2a) = 20 has no solution, either.
Altogether, there are five satisfactory values m g {25,33,44,50,66}.
iv) This problem is of a different kind than the previous ones, so we must approach otherwise. The relation <p(m) = y implies that m must be a multiple of three (since the left-hand side of the equation is an integer). We will thus be looking for the solution in the form m = 3Q -n, where 3 \ n, a > 1. Then, <p(m) = 2 ■ 30-1 ■ <p(n) = f = 30-1 ■ n. Reducing this leads to 2ip(n) = 71 or, equivalently, ip(n) = §. Here, we must have 2 | 71, and writing 71 = 213 ■ k, where (k, 6) = 1, (3 < 1, we get ip(k) = k, which is apparently satisfied only by k = 1.
To summarize it, the problem is satisfied by those natural numbers which are of the form 2Q3'3, where q,/3>1. □
11.B.14.   Find all two-digit numbers 71 for which 9\(p(n).
Solution. Let us consider the factorization of the number 71 to primes. If 71 = p"1 ■■■pk*k, then ip(n) = (pi — l)Pi1_1 ■ ■ ■ (pk - l)p£fc_1. And if we want to have 9 | <p(n), then at least one of the following conditions must hold:
Remark. We have already used some properties of congruences without explicitly mentioning it - now, the result of the exercise 11.A.3 can be reformulated as "if a = 1 (mod m), b = 1 (mod 771), then also ab = 1 (mod m)", which is a special case of item (2) of the previous theorem.
It is not by chance because any statement which uses congruences can be reformulated in terms of divisibility. The usefulness of congruences thus lies not in the strength to solve more problems than without them, but rather in being a very convenient way of writing which simplifies both expressions and reasonings.
11.3.1. Möbius inversion formula and Euler function.
J§ti€ Here, an arithmetic function means any
,4*X*ak» function whose domain is the set of natural s— — numbers.
Definition. Let a natural number 71 be factored to primes: 71 = p"1 ■ ■ ■ p^k. The value of the Möbius function fi(n) is defined to be 0 if a{ > 1 for some i, and (—l)k otherwise. Further, we define /i(l) = 1 (in accordance with the convention that 1 factors to the product of zero primes).
Example. /i(4) = /i(22) = 0,/i(6) = /i(2-3) = (-1)2 = 1, /i(2) = /i(3) = -1.
Now, we prove several important properties of the Möbius function, especially the so-called Möbius inversion formula.
Lemma. For all n G N\ {1}, it holds that J2d^n fi(d) = 0.
Proof. Writing 71 as 71 = p"1 ■ ■ ■ p^k, then all divisors d of 71 are of the form d = p^1 ■ ■ ■ p^, where 0 < ßi < a{ for all i G {1,..., k}. Therefore,
d\n (ß1,...,ßk)e(Nu{o})k
0<ß,<a,
= E
(ß1,...,ßk)e{o,i}k = (o) + (i) •(-!) + (2) -(-l)2 + ---+©-(-l)fc = (1 + (-1))* = 0.
In the third equality, we used a combinatorial reasoning - the summand Q) (—l'f gives the contribution of the divisors d = Pi1''' Pkk wim the property that exactly I of the exponents ßi,..., ßk are equal to one; there are (^) of them, and each satisfies that <i(p{x ■■■pßkk) = {-if. □
There is another concept which is tightly connected to the Möbius function, the so-called Dirichlet product (also Dirich-let convolution).
Definition. Let /, g be arithmetic functions. Its Dirichlet product is defined as follows:
(/° <?)(«) = £/(<*)-0(3)= E f^)-9(d2).
d\n d1d2=n
750
CHAPTER 11. ELEMENTARY NUMBER THEORY
i) pi = 1 (mod 9) for some i G {1,..., k},
ii) pi = 3 and a{ > 3 for some i e {1,..., k},
iii) pi = 3, Qi = 2, and p,- = 1 (mod 3) for some distinct i,j e {l,...,fc},
iv) pi = 1 (mod 3) and pj = 1 (mod 3) for some distinct
i,j e {l,...,fc}.
If we restrict our attention (as the statement of the problem asks) to numbers n < 100, then the condition
i) is satisfied by primes 19, 37, and 73 (together with their multiples 38,57,76,95, and 74),
ii) is satisfied by 33 = 27,34 = 81 (together with amultiple 54),
iii) is matched by the number 32 ■ 7 = 63,
iv) is matched by the number 7 ■ 13 = 91. D
11. B.15. Fermat's (little) theorem. Now, we will prove Fer-
'^iS^ mat's ^e meorem 11-3-2 in two more ways: by mathematical induction, and then by a combinatorial means). The theorem states that for any in-teger a and any prime p which does not divide a, it holds that ap_1 = 1 (mod p).
Solution. First, we prove (by induction on a) that an apparently equivalent statement, ap = a (mod p), holds for any a e Z and prime p. For a = 1, there is nothing to prove. Further, let us assume that the proposition holds for a and prove its validity for a +1. It follows from the induction hypothesis and the exercise 11.B.6 that
(a + l)p = ap+ lp = a + l   (mod p),
which is what we were to prove.
The statement holds trivially for a = 0 as well as in the case a < 0,p = 2. The validity for a < 0 and p odd can be obtained easily from the above: since —a is a positive integer, we get — ap = (—a)p = —a (mod p), whence ap = a (mod p).
The combinatorial proof is a somewhat "cunning" one: Similarly to problems using Burnside's lemma (see exercise
12. G.1), we are to determine how many necklaces can be created by wiring a given number of beads, of which there is a given number of types. Having a types of beads, there are clearly ap necklaces of length p, a of which consist of a single bead type. From now on, we will be interested only in the other ones, of which there are thus ap — a. Apparently, each necklace is transformed into itself by rotating by p beads. In general, a necklace can be transformed into itself by rotating
Lemma. The Dirichlet product is associative. Proof.
{{fog)oh){n) = !{di)g{d2)h{d-i) = {fo{goh)){n)
□
Example. Let us define two helping functions I and / by
1(1) = 1, I(n) = 0 for all n > 1 and I(n) = 1 for all n e N. Then, every arithmetic function / satisfies:
/oI = Io/ = /   and   (Jo/)(n) = (/oi)(n) = ^/(d).
d\n
Further, Io/i = /z o I = I, since
(I o H){n) = £/(d)/i (§) H(d) =
J2ß(d) = 0 foralln>l
by the lemma after the definition of the Möbius function (the statement is clearly true for n = 1).
Theorem (Möbius inversion formula). Let an arithmetic function F be defined in terms of an arithmetic function f by F{n) — ~l2d\n f(d)- Then, the function f can be expressed as
d\n
Proof. The relation F(n) = J2d\n /(°0 can ^e rewrit-tenas-F = fol. Therefore,Fop, = (/o7)o/i = fo(Ioß) = f o I = /, which is the statement of the theorem. □
Definition. A multiplicative function on the natural numbers is such an arithmetic function which, for all pairs of coprime natural numbers a, b, satisfies
/(a •&) = /(<*)•/(&).
Example. Multiplicative functions include, for instance, u(n), r(n), p,(n) (this can be verified easily from the definition) or, as we will prove shortly, the so-called (Euler) totient function <p(n).
Euler's totient function p
For a natural number, we define the value of Euler's totient function ip as
ip(n) = \{a e N | 0 < a < n, (a,n) = 1}
Example. p(l) = 1,^(5) = 4, p(Q) = 2. If pis a prime, then clearly p(p) =p — l (all natural numbers less than p are coprime to it).
Now, we are going to prove several important properties of the ip function.
Lemma. Let n e N. Then, J2d\n fid) = n.
751
CHAPTER 11. ELEMENTARY NUMBER THEORY
by another number of beads, but this number can never be coprime to p (for instance, considering p = 8 and the necklace ABABABAB, rotations by 2,4, or 6 beads leave it unchanged). However, if p is a prime, it follows that all rotations lead to different necklaces. Therefore, if we do not distinguish necklaces which differ in rotation only (i.e., in the position of the "knot"), there are exactly
nP - n
P
of them, which especially means that p | aP — a.
As an example, let us consider the case a = 2,p = 5, i.e., necklaces of length 5, consisting of 2 bead types (A, B). There are 25 = 32 necklaces in total, 2 of which consist of a single bead type (AAAAA, BBBBB). Leaving them and ignoring the position of the knot, there remain = 6 necklaces which differ not merely in rotation, namely ABBBB, AABBB, AAABB, AAAAB, ABABB, AABAB.
□
Euler's theorem and orders of integers modulo m. Thanks to Euler's theorem, it is guaranteed that every integer a which is coprime to m has an order, i.e. the least natural number n such that an = 1 (mod m). The most interesting ones are those integers a whose order equals <p(m); they are called the primitive roots modulo m.
11.B.16.   Determine the order of 2 modulo 7.
Solution. The order of 2 modulo 7 is equal to 3 as
v 21 = 2 ^ 1 (mod 7), 22 = 4 ^ 1 (mod 7), 23 = 8 = 1 (mod 7). □
11.B.17.   Determine the last two digits of the number 72013.
Solution. We can easily see that the order of 7 modulo 100 is equal to 4 - by simple calculations, we have 72 = 49 and
492 = (50 - l)2 = 502 - 2 ■ 50 + 1 = 1 (mod 100). Therefore, it suffices to determine the remainder r of the integer 2013 when divided by 4, since 72013 = T (mod 100). Apparently, we have r = 1, so the wanted last two digits are 07.
□
Now, we mention several statements about the properties of the order of an integer modulo m.
11.B.18. Let m G N, a, b G Z, (a, m) = (b,m) = 1. Prove that if a = b (mod m), then the integers a, b share the same order modulo m.
Proof. Let us consider the n fractions
12   3        n - 1 71
1     1     1 • • • 1 1 •
71   71   71 71 71
Reducing them to lowest terms and grouping them together by the denominators, we get just the statement in question. □
Theorem. Let n G N factor to primes as n Then,
Pi
Pk ■
p(n)
1
1
Pi
1
1
Pk
Proof. Invoking the previous lemma and the Möbius inversion formula, we get
d\n ■ 71 —
71 Pi
1 Pi
Pk
+ ■
+ (-1)
Pl---Pk
1
Pk
□
Remark. The previous result can also be obtained without using the Möbius inversion formula by the inclusion-exclusion principle, determining the number of integers not coprime to 7i in a given interval.
It follows directly from this theorem that the totient function is a multiplicative arithmetic function.
Corollary. Let a,beN, (a, b) = 1. Then
p(a ■ b) = p(a) ■ p(b).
Remark. This statement can also be derived independently of the knowledge that (ti, ab) = 1 (ti, a) = 1 A
(ti, b) = 1. Together with the easy result
v(pQ)=pQ-pQ-1 = (p-l)-pQ-\ one can deduce the formula for the computation of p in a third way.
11.3.2. Fermat's little theorem, Euler's theorem. These theorems belong to the most important results of elementary number theory, and they will often be applied in further theoretical as well as practical problems.
Theorem (Fermat's little). Let a be an integer and pa prime, p\ a. Then,
aP'1 = 1   (mod p).
Proof. The statement will follow as a simple consequence of Euler's theorem (and together with this one, it is a consequence of more general Lagrange's theorem 12.3.10). However, it can be proved directly (by mathematical induction or a combinatorial means, as mentioned in exercise 11.B.15). □
752
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution. Raising the congruence a = b (mod m) to the n-th power leads to an = bn (mod m), so a" = 1 (mod m) bn = 1 (mod m). □
11.B.19. Let m e N, a e Z, (a,m) = 1. If the order of a modulo m is r ■ s, (where r, s e N), prove that the order of the integer ar modulo mis s.
Solution. Since none of the integers a, a2, a3,..., ars_1 is congruent to 1 modulo m, nor is any of the integers ar, a2r, a3r,..., a(s_1)r.    On the other hand, we have (ar)s = 1 (mod m), so the order of ar modulo m equals s. □
11.B.20. Show that the contrary of the previous statement need not be true in general.
Solution. Indeed, even if the order of an integer ar modulo mis s, the order of a modulo m may not be r ■ s. For instance, for m = 13 and the integers a = 3, b = —4, we have a2 = 9, a3 = 27 = 1 (mod 13), so the order of a modulo 13 is 3. Similarly, b2 = 16 ^ 1 (mod 13), b3 = -64 = 1 (mod 13), so the order of b modulo 13 is 3, too. On the other hand, b2 = (-4)2 = 16 = 3 = a (mod 13) has the same order (3) as a, yet the integer b does not have order 2-3. □
11.B.21. Determine the last digit of the numbers
i) 35 ,
ii) 373737,
iii) 121314. o
11.B.22.
i) Determine the remainder of the integer 250 + 350 + 450 when divided by 17.
ii) Determine the remainder of the integer 2181+3181+5181 when divided by 37.
Solution.
i) By Fermat's theorem, we have 216 = 316 = 416 = 1 (mod 17). Since 50 = 2 (mod 16), we get
2so + 350 + 4so = 22 + 32 + 42 = 12 (mod 17) .
ii) Similarly 236 = 336 = 536 = 1 (mod 37), and hence 2isi + 3I81 + 5i8i = 2 + 3 + 5 = 10 (mod 37) . □
11.B.23. It holds for all odd n e N that n \ 2n! - 1. Prove this! O
Sometimes, Fermat's little theorem is presented in the following form, which is apparently equivalent to the original statement.
Corollary. Let a be an integer and p a prime. Then,
ap = a   (mod p).
Before formulating and proving Euler's theorem, we introduce a few useful concepts.
Residue systems
A complete residue system modulo m is an arbitrary m-tuple of integers which are pairwise incongru-ent modulo m (the most commonly used m-tuple is 0,1,..., m — 1 or, for odd m, its "symmetric" variation
_m — l _-I   r, -I m—1 \
A reduced residue system modulo m is an arbitrary <p(m) -tuple of integers which are pairwise incongruent modulo m and coprime to m.
Lemma. Let x\, x2, ■ ■ ■, 2;^(m) form a reduced residue system modulo m. If a G Z, (a, m) = 1, then the integers a ■ x\,..., a ■ iv(m) also form a reduced residue system modulo m.
Proof. Since (a, m) = 1 and (xi,m) = 1, we have (a ■ Xj, m) = 1. Further, if we had a ■ Xi = a ■ Xj (mod m) for some distinct indices i, j, dividing both sides of the congruence by the integer a (which is coprime to m) would lead to Xi = xj (mod m), meaning that the original ip (m)-tuple was not a reduced residue system, either. □
Theorem (Euler). Let a e Z, m G N, (a, m) = 1. Then,
aV(m) ^ 1     (mod my
Proof. Let x\, x2, ■ ■ ■, xv(m) be an arbitrary reduced residue system modulo m. By the previous lemma, a ■ x\,..., a ■ iv(m) is also a reduced residue system modulo m. Therefore, for every i e {1,2,..., p(m)}, there is a unique j £ {1,2,..., p(m)} such that a ■ Xi = Xj (mod m). Multiplying these congruences leads to (a ■ x{) ■ (a ■ x2) ■■■{a-s?(m)) = x\ ■ x2 ■ ■ ■ x^m) (mod m), which can be rearranged to
af(m) .Xl.X2... x^m) =Xi-X2--- Iv(m)     (mod 77l) .
Dividing this by the integer xi • X2 • • • iv(m) already gives the wanted statement. □
Remark. As we have already mentioned, Euler's theorem is a consequence of Lagrange's theorem (see 12.3.10) applied to the group (Z_£, •). This proof of Euler's theorem utilized the fact that multiplying by an integer a which is coprime to m is, in algebraic words, an automorphism of the group (Z_£, •).
There is an important concept which is tightly connected to Euler's totient function and Euler's theorem: the so-called order of an integer modulo m - once again, it is nothing else than the order of the corresponding element in the group of invertible residue classes modulo m:
753
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.B.24.
i) Determine the last digit of the number 79
ii) Determine the remainder of the number 1514" when divided by 11.
Solution.
i) The order of 7 modulo 100 is equal to 4 by exercise 11.B.17, so it suffices to find the remainder of the (extremely large) exponent upon division by 4. Since 9 = 1 (mod 4), the entire exponent leaves remainder 1 as well. Therefore, the wanted digit is 71 = 7.
ii) The order of 15 = 4 (mod 11) is 5 (which can be found by direct computation or from the fact that 2 is a primitive root modulo 11 (see also 11.B.28); then, theorem
11.3.4 yields that the order of 4 = 22 is
10 (10,2)
5). It
is thus sufficient to determine the remainder of the exponent modulo 5. We have
1413 = (-1)13 = -1 = 4   (mod 5),
so the wanted remainder is 44 = 28 = 256 = 6—5+2 = 3 (mod 11). (We could also have proceeded as follows:
44 = 4"1 = 3 (mod 11).) □
11.B.25.   Determine the last two digits of the decimal expan-,\\+v, sion of the number 1414".
Solution. We are interested in the remainder of the
- . 14
number a = 1414 upon division by 100. However, since (14,100) > 1, we cannot consider the order of 14 modulo 100. Instead, we can factor the modulus to coprime integers: 100 = 4-25. Apparently, 4 | a, so it remains to find the remainder of a modulo 25. By Euler's theorem, we have 14^(25) = 1420 = 1   (mod 25^
so we are interested in the remainder of 1414 upon division by 20 = 4 ■ 5. Again, we clearly have 4 | 1414, and further 1414 = (-1)14 = 1 (mod 5), so
1414 = 16   (mod 20).
Altogether,
141
141
71
(mod 25).
We can simplify the computation to come a lot if we realize that
72 = -1 (mod 25), and 25 = 7 (mod 25).
Order of an integer
Let a G Z, m G N, where (a, m) = 1. The order of a modulo m is the least natural number n satisfying
an = 1   (mod m).
It follows from Euler's theorem that the order of an integer is well-defined - the order of any integer coprime to the modulus is surely not greater than <p(m). As we will see later, the integers whose order is exactly <p(m) are of great interest - they are called primitive roots modulo m and play an important role in solving binomial congruences, among others. This concept is just another name for a generator of the group
Some of the very basic results of the order are demonstrated in 11.B.18, and complete description of the dependency of the order upon the exponent is given by the subsequent two theorems.
11.3.3. Theorem. Let m G N, a G Z, (a,m) = 1. Let r denote the order of a modulo m. Then, for any t,s G No, we have
a* = as   (mod m)
t = s   (mod r).
Proof. Without loss of generality, we can assume that t > s. Dividing the integer t — s by r with remainder, we get t — s = q ■ r + z, where q, z G No, 0 < z < r.
" <= " Since t = s (mod r), we have z = 0, hence at-s _ aqr _   ^ary = yi ^mo(j m\   Multiplying both
sides of the congruence by the integer a8 leads to the wanted statement.
" =^ " It follows from a* = as (mod m) that as ■ aqr+z = as ^moc[ my Since ar = 1 (mod m), we also have aqr+z = az (mod m). Altogether, after dividing both sides of the first congruence by the integer as (which is co-prime to the modulus), we get az = 1 (mod m). Since z < r, it follows from the definition of the order that z = 0, hence r | t — s. □
The above theorem and Euler's theorem apparently lead to the following corollary (whose second part is only a reformulation of Lagrange's theorem 12.3.10 for our situation):
1, and r be the
Corollary. Let m G N, a G Z, (a, m) order of a modulo m.
(1) For any n G N U {0}, it holds that
an = 1   (mod m) ^==^
(2) r | <p(m)
The following theorem is a generalization of the result in 11.B.19.
11.3.4. Theorem. Letm,n G N, a G Z, (a, m) = 1. If the order of a modulo m is r, then the order of an modulo m is
r
(n,r)'
754
CHAPTER 11. ELEMENTARY NUMBER THEORY
Then,
141414 = 216 . 716 = ^5^3 . 2 . 716 ^ 73 . 2 . 716
= 2 ■ 719 = 2 ■ (-1)9 ■ 7= 11   (mod 25).
We are thus looking for a non-negative integer which is less than 100, is a multiple of 4, and leaves remainder 11 when divided 25 - the only such number is clearly 36. □
11.B.26.   Determine the last three digits of the number
Solution. Similarly to the above exercise, we will examine the remainders upon division by coprime integers 125 and 8. We know that (12,125) = 1 and y>(125) = 100, so
121
121
121
l1
1   (mod 125).
Since 4 | 12, the number 121C|11 is divisible even by 41C|11, so it is by mere 8 as well, hence 121C|11 = 0 (mod 8). The Chinese remainder theorem states that exactly one of the integers 0,1,..., 999 leaves remainder 1 upon division by 125 and is divisible by 8. This integer is 376 (it can be found, for instance, by going through the multiples of 125 increased by 1 and examining their divisibility by 8). Therefore, the last three digits of the number 121C|11 are 376. □
11.B.27.   Find all natural numbers n for which the integer v\^v  5™ — 4™ — 3™ is divisible by eleven.
Solution. The orders of all of the numbers 3, 4, *§■    and 5 are equal to five, so it suffices to examine n e {0,1,2,3,4}. It can be seen from the following table
n		0	1	2	3	4
5n	mod 11	1	5	3	4	9
	mod 11	1	4	5	9	3
3™	mod 11	1	3	9	5	4
that only the case n = 2 (mod 5) yields 3 — 5 — 9 = 0 (mod 11).
The problem is thus satisfied by exactly those natural numbers n which satisfy n = 2 (mod 5). □
11.B.28. Primitive roots. Show that there are no primitive roots modulo 8, and find any primitive root modulo 11. Solution. Apparently, even integers cannot be primitive roots modulo 8, so it remains to examine odd ones. We can easily calculate that 32 = 52 = 72 = 1 (mod 8), but p(8) = 4 > 2.
We will verify that 2 is a primitive root. The order of 2 divides <p(ll) = 10, so it suffices to show that 22 ^ 1
Proof. Since = [r, n], which is clearly a multiple of r, we have
=        = 1   (mod m)
(the last statement follows from the above corollary, because r | [r,n]). On the other hand, if k e N is such that (an)k = an-k = ^ (mod m), we get r | n ■ k (since r is the order of a). Further, we know that t^-y | t-^-y ■ k, whence (thanks to
and
(n,r)
being coprime)
(n,7 _r_
(n,7
k. Therefore,
the order of the integer an modulo m.
(n,r
is
□
The last statement of this series connects the orders of two integers to the order of their product:
Lemma. Let m e N, a,b G Z, (a,m) = (b,m) = 1. If a has order r and b has order s modulo m, where (r, s) = 1, then the integer a ■ b has order r ■ s modulo m.
Proof. Let S denote the order of a ■ b. Then, (ab)s = 1 (mod m). Raising both sides of this congruence to the r-th power leads to arSbrS = 1 (mod m). Since r is the order of a, we have ar = 1 (mod m), i.e., brS = 1 (mod m), and so s | rS. From r being coprime to s, we get s | S. Analogously, we can get r | S, so (again utilizing that r, s are coprime) r ■ s | S. On the other hand, we clearly have (ab)rs = 1 (mod m), hence 5 \ rs. Altogether, 5 = rs. □
11.3.5. Primitive roots. Among the integers coprime to a modulus m (i.e., the elements of a reduced residue system modulo m), the most important ones are those whose order is equal to <p(m). Step-by-step exponentiation of such a number yields all possible elements of a reduced residue system (or integers congruent to them). Therefore, in various problems, we can work with powers of a given integer instead of considering random elements of a reduced residue system modulo m, and this is often much simpler (see, for instance, the proof of the theorem 11.4.10 about binomial coefficients).
Primitive root
Let m e N. An integer g, (g, m) = 1, is said to be a primitive root modulo m iff its order modulo m equals <p(m).
Lemma. If g is a primitive root modulo m, then for every integer a such that (a, m) = 1, there is a unique xa G Z, 0 < xa < <p(m) with the property that gXa = a (mod m).
The mapping a i-> xa is called the discrete logarithm or index of the integer a (with respect to a given modulus m and a fixed primitive root g), and it is a bijection between the sets {a £ Z; (a, m) = 1, 0 < a < m} and {x e Z; 0 < x < <p(m)}.
Proof. Suppose that it holds for x, y e Z, 0 < x, y < <p(m) that gx = gy (mod m). From the properties of the order, we get x = y (mod ip(m)), i.e., x = y, so the mapping is injective. Since it is a mapping between two finite sets which have the same number of elements, it must be surjec-tive as well. □
755
CHAPTER 11. ELEMENTARY NUMBER THEORY
(mod 11) and 25 = 32 = -1 ^ 1 (mod 11). Therefore, the
order of 2 modulo 11 is indeed 10.
□
11.B.29.   We will now determine (with the help of some the-j\, orems from the theoretical part) primitive roots modulo 41, 412, and 2 ■ 412.
Solution. Since y>(41) = 40 = 23 ■ 5, it holds that an integer g coprime to 41 is a primitive root modulo 41 if and only if
g20 ^ 1   (mod 41) A g8 ^ 1   (mod 41).
Now, we will go through the potential primitive roots in ascending order:
5 = 2
5 = 3 5 = 4 5 = 5
28 = 25-23 = -9-8 = 10   (mod 41),
220 = (25)4 = (_9)4 = gl2 = (_1)2
= 1   (mod 41), 38 = (34)2 = (-1)2 = 1   (mod 41), the order of 4 = 22 always divides the order 2,
58 = (52)4 = (-24)4 = 216 = (28)2 = 102 = 18   (mod 41),
r,20\2
52U = (52)iu = (-24)iu = 24U = (2 = 1   (mod 41), g = 6:   68 = 28-38 = 10-l = 10   (mod 41),
g20 = 220 . 320 = 220 . (38)2 . g4
=■ 1 • 1 • (-1) =-1   (mod 41).
We have thus proved that 6 is the least positive primitive root modulo 41 (if we were interested in other primitive roots modulo 41 as well, we would get them as the powers of 6 with exponent taking on values from the range 1 to 40 which are coprime to 40. There are exactly y>(40) = (p(23 -5) = 16 of them, and the resulting remainders modulo 41 are ±6, ±7, ±11, ±12, ±13, ±15, ±17, ±19).
Now, if we prove that 640 ^ 1 (mod 412), we will know that 6 is a primitive root modulo any power of 41 (if we had "bad luck" and found out that 640 = 1 (mod 412), then a primitive root modulo 412 would be 47 = 6 + 41). To avoid manipulating huge numbers when verifying the condition, we will use several tricks (the so-called residue number system).
First of all, we calculate the remainder of 68 upon division by 412; this problem can be further reduced to computing
If there are primitive roots at all for a natural number m, then there are exactly ip(ip(m)) of them among the integers 1,2,..., m: If g is a primitive root and a e {1,2,..., <p(m)} arbitrary, then the order of ga is (by theorem 11.3.4), which is equal to <p(m) if and only if (a, <p(m)) = 1, and there are exactly ip(ip(m)) such integers in the set {1,2,..., ip(m)}.
Now, we are about to show that primitive roots exist for a sufficient amount of moduli m.
11.3.6. Theorem (Existence of primitive roots). Let m e N, m > 1. The modulus m has primitive roots if and only if at least one of the following conditions holds:
• m = 2 or m = 4,
• mis a power of an odd prime,
• m is twice a power of an odd prime.
The proof of this theorem will be done in several steps. jj>,, We can easily see that 1 is a primitive root modulo 2 and 3 is a primitive root modulo 4. Further, we will show that primitive roots exist modulo any odd prime f ^   (in algebraic words, this is another proof of the fact that the group (Z^, •) of invertible residue classes modulo a prime m is cyclic; see also 12.3.8).
Proposition. Let p be an odd prime. Then there are primitive roots modulo p.
Proof. Let r i, r2,..., rp_ i be the orders of the integers 1, 2,... ,p — 1 modulo p. Let S = [r1, r2,..., rp-i] be the least common multiple of these orders. We will show that there is an integer of order S among 1, 2,..., p — 1 and that
5 = p-l.
Let S = q"1 ■ ■ ■ q^k be the factorization of S to primes. For every s e {1,..., k}, there is a c e {1,... ,p — 1} such that q"s | rc (otherwise, there would be a common multiple of the integers r1, r2,..., rp_i less than S). Therefore, there exists an integer b such that rc = b ■ q"s. Since c has order rc, the order of the integer gs := cb is equal to q"s (by the theorem 11.3.4 on orders of powers).
Reasoning analogously for any s e {1,..., k}, we get integers g±,..., gk, and we can set g := g\ ■ ■ ■ g^. From the properties of the order of a product, we get that the order of g is equal to the product of the orders of the integers gi,...,gk, i.e. to q°1 ■ ■ ■ g£fc = S.
Now, we prove that S = p — l. Since the orders of the integers 1, 2,..., p — 1 divide S, we get the congruence xs = 1 (mod p) for any x G {1,2,... ,p — 1}. By theorem 11.4.8, there are at most S solutions to a congruence of degree S modulo a prime p (in algebraic words, we are actually looking for roots of a polynomial over a field, and there cannot be more of them than the degree of the polynomial, as we will see in part 12.2.4). On the other hand, we have already shown that this congruence has p—1 solutions, so necessarily S >p—l. Still, S is (being the order of g) a divisor of p — 1, whence we finally get the wanted equality 5 = p — 1. □
756
CHAPTER 11. ELEMENTARY NUMBER THEORY
the remainders of the integers 28 and 38:
28 = 256 = 6-41 + 10 (mod412),
38 = (34)2 = (2 ■ 41 - l)2 = -4 ■ 41 + 1   (mod 412).
Then,
68 = 28 ■ 38 = (6 ■ 41 + 10)(-4 -41 + 1) = -34 ■ 41 + 10 = 7 ■ 41 + 10   (mod 412)
and
64o = ^5 = (7 . 41 + 10)5 = (io5 + 5 ■ 7 ■ 41 ■ 104) = 104(10 + 35 ■ 41) = (-2 ■ 41 - 4)(-6 ■ 41 + 10) = (4-41 -40) = 124 ^ 1 (mod412).
In the calculation, we made use of the fact that 104 = 6 ■ 412 - 86, i.e., 104 = -2 ■ 41 - 4 (mod 412).
Therefore, 6 is a primitive root modulo 412, and since it is an even integer, we can see that 1687 = 6 + 412isa primitive root modulo 2 ■ 412 (while the least positive primitive root modulo 2 ■ 412 is the integer 7). □
Möbius inversion formula and irreducible polynomials.
In the theoretical part, we prove the properties
»7<5^ of Euler's totient function using the so-called
fi f 1
Möbius inversion formula.
The standard form of this formula connects the expression of an arithmetic function F of natural numbers in terms of a function / in the form
F(n) = J2f(d)
d\n
to the inverse expression of the function / in terms of the function F in the form
/(n) = £>(3)-W
d\n
The function value fi(n) depends on the prime factorization of the input value n as follows:
• if a square of a prime divides n, then fi(n) = 0;
• otherwise, we set fi(n) = (—l)fe, where k is the number of primes which divide n.
This formula can be generalized in many ways - especially in cases where the functions F and / map natural numbers into an abelian group (G, ■). In this case, the formula
Now, we show that there are primitive roots modulo powers of odd primes. First, we prove two helping lemmas.
Lemma. Let p be an odd prime, £ > 2 arbitrary. Then, it holds for any a G Z that
-l
(1 + apf
l + ap*
(mod p
Proof. This will follow easily from the binomial theorem using mathematical induction on £.
I. The statement is clearly true for £ = 2.
II. Let the statement be true for £, and let us prove it for £ + 1. Invoking exercise 11.B.7 and raising the statement for £ to the p-th power, we obtain
(1 + apf''1 = (1 + ap'^f   (mod pl+1).
It follows from the binomial theorem that
(l + ap^f = l+p-a-pe-1+ J2
l)akpV-1)k
and since we have p\ (£) for 1 < k < p (by exercise 11 .B .6), it suffices to show that pe+1 | p1+(£_:L)fe, which is equivalent to 1 < (k - 1)(£ - 1). Thanks to the assumption £ > 2, we get that pe+1 | for k = p as well. □
Lemma. Let p be an odd prime, £ > 2 arbitrary. Then, it holds for any integer a satisfying p\ a that the order ofl+ap modulo pl equals pe_1.
Proof. By the previous lemma, we have
(l + apf1'1 = l + apl (mod/+1),
and considering this congruence modulo pe, we get (1 + ap)p = 1 (mod pe). At the same time, it follows directly from the previous lemma and p not being a divisor of a that (1 + ap)p    ^ 1 (mod pe), which gives the wanted propo-
sition.
□
Proposition. Let p be an odd prime. Then, for every £ G N, there is a primitive root modulo pl.
Proof. Let g be a primitive root modulo p. We will show that if gv~x ^ 1 (mod p2), then g is a primitive root even modulo pe for any I 6 N, (If we had gv~x = 1 (mod p2), then (g + p)p_1 = 1 + (p — l)gp~2p ^ 1 (mod p2), so we could choose g + p for the original primitive root instead of the congruent integer g.)
Let g satisfy gv~x ^ 1 (mod p2). Then, there is an a G Z, p \ a such that gv~x = 1 + p ■ a. We will show that the order of g modulo pl is ip(pe) = (p—l)pe~1. Let n be the least natural number which satisfies gn = 1 (mod pe). By the previous lemma, the order of gp_1 = 1 + p ■ a modulo pl is pe~1. However, then it follows from the corollary of 11.3.3 that
■V
(gp~T = (gny = i (mod/;
At the same time, the congruence gn = 1 (mod p) implies that p — 1 | n. From p — 1 and pe_1 being coprime, we get
757
CHAPTER 11. ELEMENTARY NUMBER THEORY
(considering the operation in G to be multiplicative) gets the form
/(n) = 5>(d)K3).
d\n
Now, we will demonstrate the use of the Möbius inversion formula on a more complex example from the theory of unite fields. Letus consider a p-element field Fp (i.e., the ring of residue classes modulo a prime p) and examine the number Nd of monic irreducible polynomials of a given degree d over this field. Let Sd(x) denote the product of all such polynomials. Now, we borrow a (not very hard) theorem from the theory of unite fields which states that for all nsN,we have
-x = l[Sd(x).
d\n
Confronting the degrees of the polynomial on both sides yields
p" = Y/dNd,
d\n
whence we get, by applying the standard Möbius inversion formula, that
d\n
In particular, we can see that for any n G N, it holds that Nn = ^ (pn — ■ ■ ■ + fi(n)p) ^ 0 since the expression in the parentheses is a sum of distinct powers of p multiplied by coefficients ±1, so it cannot be equal to 0. Therefore, there exist irreducible polynomials over Fp of an arbitrary degree n, so there are unite fields Fp™ (having pn elements) for any prime p and natural number n (in chapter 12, we will show that such a field can be constructed as the quotient ring ¥p[x]/(/) of the ring of polynomials over Fp modulo the ideal generated by an irreducible polynomial / G Fp [x] of degree n, whose existence has just been proved).
11.B.30. Determine the number of irreducible polynomials over Z2 of degree 5 and the number of monic irreducible polynomials over Z3 of degree 4.
Solution. By the formula we have proved, the number of (monic) irreducible polynomials over Z2 of degree 5 is equal to
^ = f£"(i)2d = ! (Ml) ■ 25 + M5) • 2) = 6.
d\5
The number of monic irreducible polynomials over Z3 of degree four is then
that (p — l)pe 1 I n. Therefore, 71 = p(pe), and g is thus a primitive root modulo pe. □
Proposition. Let p be an odd prime and g a primitive root modulo pl fori G N. Then the odd one of the integers g, g+pl is a primitive root modulo 2pl.
Proof. Let c be an odd natural number. Then, for every 71 G N, we have cn = 1 (mod pe) if and only if cn = 1 (mod 2pe). Since p(2pe) = p(pe), every odd primitive root modulo pe is also a primitive root modulo 2pe. □
The subsequent proposition describes the case of powers of two. We will use similar helping lemmas as in the case of odd primes.
Lemma. Let £ G N, £ > 3. Then h2"^ = 1 + 1l~x (mod 2l).
Proof. Similarly as above for 2\p. □
Lemma. Let £ G N, £ > 3. Then the order of the integer 5 modulo 2l is 2l~2.
Proof. Easily from the above lemma. □
Proposition. Let£ G N. There are primitive roots modulo 2l if and only if£ < 2.
Proof. Let £ > 3. Then the set
S = {(-l)a ■ 5b; a G {0,1}, 0 < b < 2e~2; b G Z}
forms a reduce residue system modulo 2e: it has ip (2£) elements, and it can be easily verified that they are pairwise incongruent modulo 2e.
At the same time (utilizing the previous lemma), the order of every element of S apparently divides 2e~2. Therefore, this reduced system cannot (and nor can any other) contain an element of order p(2e) = 2e~1. □
The last piece to the jigsaw puzzle of propositions which collectively prove theorem 11.3.6 is the statement about nonexistence of primitive roots for composite numbers which are neither a power of prime nor twice such.
Proposition. Let m G N be divisible by at least two primes, and let it not be twice a power of an odd prime. Then, there are no primitive roots modulo m.
Proof. Let m factor to primes as 2ap"1 ■ ■ ■ p^k, where a 6 N0,o, 6 N,2 f Pi, and k > 2 or both k > 1 and a > 2. Denoting(S= [y(2Q), ^(p"1),..., ^(p"1)], we can easily see that S < <p(2a) ■ pip"1) ■ ■ ■ pip"1) = p(m) and that for any a G Z, (a, 771) = 1, we have as = 1 (mod 771). Therefore, there are no primitive roots modulo m. □
In general, it is computationally very hard to find a prim-®,       itive root for a given modulus. The following ■j£jY      theorem describes a necessary and sufficient 5£2|£gb£ condition for the examined integer to be a primitive root.
758
CHAPTER 11. ELEMENTARY NUMBER THEORY
= \ £d|4 (i) 3d = \ (/i(l) ■ 34 + /i(2) ■ 32 + /i^S1) = 1(81-9) = 18. □
C. Solving congruences
Linear congruences. The following exercise illustrates that the procedure we mentioned in the proof of theorem 11.4.3 about solvability of linear congruences (which invokes Euler's theorem) is usually not the most efficient one - we can utilize both Bezout's theorem and equivalent modifications of the given congruence.
ll.C.l.   Solve the congruence
39a; = 41   (mod 47).
Solution.
11.3.7. Theorem. Let m be such an integer that there are primitive roots modulo m. Let us write p(m) = q"1
'Ik •
i) First, we use Euler's theorem. Since (39,47) = 1, we have
39*5 ■ 39 a = 3945 ■ 41   (mod 47),
3946 = 1
whence it already follows that
a = 3945-41   (mod 47).
To complete the solution, it remains to calculate the remainder of 3945 ■ 41 when divided by 47, which is left as an exercise to the kind reader, leading to the result a = 36 (mod 47). ii) Another option is to make use of Bezout's theorem. The Euclidean algorithm applied to the pair (39,47) yields
47 = 1 ■ 39 + 8, 39 = 4 ■ 8 + 7, 8 = 1-7 + 1.
In the other direction, this leads to
1 = 8- 7 = 8-(39- 4-8) = 5- 8-39 = 5 ■ (47 - 39) - 39 = 5 ■ 47 - 6 ■ 39.
where qi,..., qt are primes and u\,..., G N. Then, for every j 6 Z, (g, m) = 1, it holds that g is a primitive root modulo m, if and only if neither of the following congruences holds:
j n   =1   (mod m)
,g"k   =1   (mod m)
Proof. If either of the congruences were true, it would mean that the order of g is less than p(ra).
On the other hand, if g fails to be a primitive root, then there is a d G N, d p(ra), where d < p(ra) and gd = 1 (mod m). If u = > 1, then there must be an i G
{1,..., k} such that qi\u. However, then we get
y("0 d.ju_
g i,   = g  i, = 1   (mod m).
□
4. Solving congruences and systems of them
This part will be devoted to the analog to solving equations in a numerical domain. We will actu-K^lf/.- ally be solving equations (and systems of equations) in the ring of residue classes (Zm, +, ■); we will, however, talk about solving congruences modulo m and write it in the more transparent way as usual.
Congruence in one variable
Letm G N, f{x),g{x) G Z[x]. The notation
/(a) = g(x)   (mod m)
is called a congruence in variable a, and it is understood to be the problem of finding the set of solutions, i.e., the set of all such integers c for which /(c) = g(c) (mod m).
Two congruences (in one variable) are called equivalent iff they have the same set of solutions.
The mentioned congruence is equivalent to the congruence
/(a) — g(x) = 0   (mod m).
The only method which always leads to a solution is trying out all possible values (however, this would, of course, often take too much time). This procedure is formalized by the following proposition.
11.4.1. Proposition. Let m G N, /(a) G Z[x]. Then, it holds for every a, b G Z that
= b (mod
/(a) = f(b)   (mod m).
Proof. Let/(a) = cnxn+cn-1xn~1+- ■ -+cix + + c0, where cq, c\, ..., cn G Z. Since a = b (mod m), c,a! = cpV (mod m) holds for every i = 0,1,..., n. Adding up these congruences for i = 0,1,2,..., n leads to
cnan H-----h c\a + c0 = cnbn + •
i.e., f(a) = f(b) (mod m).
■ + cib + cq   (mod m)
□
759
CHAPTER 11. ELEMENTARY NUMBER THEORY
Considering this equality modulo 47 and remembering that we are solving the equation 41 = x ■ 39, we obtain
(mod 47),      / -41 (mod 47),
(mod 47), (mod 47), (mod 47).
Let us notice that this procedure is usually used in the corresponding software tools - it is efficient and can be easily made into an algorithm. It was also important that 41 (the number we multiplied the congruence with) and the modulus 47 are coprime. iii) Concerning paper-and-pencil calculations, the most efficient procedure (yet one not easily generalizable into an algorithm) is to gradually modify the congruence so that the set of solutions remains unchanged:
39a; = 41	(mod 47),	
-8a; = -6	(mod 47),	/■■-
4a; = 3	(mod 47),	
4a; = -44	(mod 47),	/:4
x = -11	(mod 47),	
x = 36	(mod 47).	
□
Systems of congruences. In order to solve system of (not only linear) congruences, we will often utilize the Chinese remainder theorem, which guarantees uniqueness of the solution provided the moduli of the particular congruences are pairwise coprime.
11.C.2. Chinese remainder theorem. If m1, m2,... ,mn are pairwise coprime natural numbers and ai, a2,..., ar any integers, then the system of congruences
x = ai (mod mi), x = ci2   (mod m2),
x = ar   (mod mr)
Corollary. The set of solutions of an arbitrary congruence modulo m is a union of residue classes modulo m.
Definition. The number of solutions of a congruence in one variable modulo m is the number of residue classes modulo m containing the solutions of the congruence.
Example. The concept number of solutions of a congruence, which we have just defined, is a bit counterintuitive in that it depends on the modulus of the congruence. Therefore, equivalent congruences (sharing the same integers as solutions) can have different numbers of solutions.
(1) The congruence 2x = 3 (mod 3) has exactly one solution (modulo 3).
(2) The congruence Wx = 15 (mod 15) has five solutions (modulo 15).
(3) The congruences from (1) and (2) are equivalent.
11.4.2. Linear congruence in one variable. Just like in the case of ordinary equations, the easiest congru-
, / ences are the linear ones, for which we are able teir^ not only to decide whether they have a solution, but to efficiently find it (provided they have some). The procedure is described by the following theorem and its proof.
11.4.3. Theorem. Let m e N, a,b G Z, and d = (a,m). Then the congruence (in variable x)
ax = b   (mod m)
has a solution if and only if d\b.
If d | b, then this congruence has exactly d solutions (modulo m).
Proof. First, we prove that the mentioned condition is necessary. If an integer c is a solution of this congruence, then we must have m | a ■ c — b. Since d = (a, m), we get d | m and d | a ■ c — b, so d | a ■ c — (a ■ c — 6) = b.
Now, we will prove that if d | b, then the given congruence has exactly d solutions modulo m. Let a±, b\ G Z and mi £ N so that a = d ■ a\,b = d ■ b\, and m = d ■ m\. The congruence we are trying to solve is thus equivalent to the congruence
a\ ■ x = b\   (mod mi),
where (ai,mi) = 1. This congruence can be multiplied by the integer d^mi^
, which, by Euler's theorem, leads to
: bx ■ a\
¥>(mi
-1
(mod mi)
This congruence has a unique solution modulo mi, thus it has d = m/m-i solutions modulo m. □
Using the theorem about solutions of linear congruences, we can, among others, prove Wilson's theorem - an important theorem which gives a necessary (and sufficient) condition for an integer to be a prime. Such conditions are extremely useful in computational number theory, where one needs to efficiently determine whether a given large integer is a prime. Unfortunately, it is not known now how fast modular factorial
760
CHAPTER 11. ELEMENTARY NUMBER THEORY
in variable a has a unique solution in the set {1,2,..., mim2 ■ ■ ■ mr}. Prove this theorem and describe the solution explicitly.
Solution. Let us denote M := mim2---mr and n{ = M/rrii for every i, 1 < i < r. Then, for any i, m{ is coprime to n{, so there is an integer b{ e {1,..., m, — 1} such that biUi = 1 (mod m,). Note that b{ni is divisible by all the numbers rrij, 1 < j < r, i =^ j. Therefore, the wanted solution of the system is the integer
x = ai&ini + a2b2n2 + ■ ■ ■ + arbrnr . □
11.C.3.   Solve the following system of congruences:
x= 1 (mod 10), a = 5 (mod 18), x = -4   (mod 25).
Solution. The integers x which satisfy the first congruence are those of the form x = 1 + lOi, where t e Z may be arbitrary. We will substitute this expression into the second congruence and then solve it (as a congruence in variable t):
1 + 10* = 5	(mod 18),
lOt = 4	(mod 18),
5t= 2	(mod 9),
5t = 20	(mod 9),
t= 4	(mod 9),
or t = 4 + 9s, where s e Z is arbitrary. The first two congruences are thus satisfied by exactly those integers x which are of the form x = 1 + lOt = 1 + 10(4 + 9s) = 41 + 90s.
Once again, this can be substituted into the third congruence and then solved:
41 + 90s =	-4	(mod 25),
90s =	5	(mod 25),
18s =	1	(mod 5),
3s =	6	(mod 5),
s =	2	(mod 5),
or s = 2 + 5r, where r G Z. Altogether, a = 41 + 90s = 41 + 90(2 + 5r) = 221 + 450r.
Therefore, the system is satisfied by those integers a with
a = 221 (mod 450). □
of a large integer can be computed. That is why Wilson's theorem is not used for this purpose in practice.
Theorem (Wilson). A natural number n > 1 is a prime if and only if
(n — 1)! = —1   (mod n).
Proof. First, we prove that every composite number n > 4 satisfies n | (n — 1)!, i.e., (n — 1)! = 0 (mod n). Let 1 < d < n be a non-trivial divisor of n. If d ^ n/d, then the inequality 1 < d, n/d < n — 1 implies what we need: n = d ■ n/d | (n — 1)1. If d = n/d, i.e., n = d2, then we have d > 2 (since n > 4) and n | (d ■ 2d) | (n — 1)!. For n = 4, we easily get (4 - 1)! = 2 ^ -1 (mod 4).
Now, let p be a prime. The integers in the set {2, 3,... ,p — 2} can be grouped by pairs of those mutually inverse modulo p, i.e., pairs of integers whose product is congruent to 1. By the previous theorem, for every integer a of this set, there is a unique solution of the congruence a ■ x = 1 (mod p). Since a ^ 0,l,p— 1, it is apparent that the solution c of the congruence also satisfies c ^ 0,1, —1 (mod p). The integer a cannot be paired with itself, either: If so, i.e., a ■ a = 1 (mod p), we would (thanks to p | a2 — 1 = (a + l)(a — 1)) get the congruence a = ±1 (mod p). The product of the integers of the mentioned set thus consists of products of (p — 3)/2 pairs (whose product is always congruent to 1 modulo p). Therefore, we have
(p - 1)1 = l(P-3)/2 ■ (p - 1) = -1   (mod p). n
11.4.4. Systems of linear congruences.   Having a system of linear congruences in the same variable, we Jp22)   can decide whether each of them is solvable by f     11' the previous theorem. If at least one of the con--.—-^ gruences does not have a solution, nor does the whole system.  On the other hand, if each of the congruences is solvable, we can rearrange it into the form a = c{ (mod nrii).
We thus get a system of congruences
a = c\   (mod mi),
a = Cfc   (mod m^).
Apparently, it suffices to solve the case k = 2 since the solutions of a system of more congruences can be obtained by repeatedly applying the procedure for a system of two congruences.
Proposition. Let ci,C2 be integers and 7711,7712 be natural numbers. Let us denote d = (mi,m2). The system of two congruences
x = ci (mod mi), a = C2   (mod m2)
has no solution if ci ^ c2 (mod d). 0« ffre other hand, if c\ = c2 (mod d), ffreM fere is an integer c such that a G Z
761
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.C.4.   Solve the system
x = 7 (mod 27), x = —3   (mod 11).
Solution, (a) Using the Euclidean algorithm, we can find the coefficients in Bézouťs identity: 1 = 5 • 11 — 2 • 27. Hence,
[li]",-1 = [5]27and [27]^ = [-2]n. Therefore, the solution is x = 7 ■ 11 ■ 5 - 3 ■ 27 ■ (-2) = 547 = 250 (mod 297).
(b) Using step-by-step substitution, we get x = lit — 3 from the second congruence. Substituting this into the first one leads to Hi = 10 (mod 27). Multiplying this by 5 yields 55i = 50, i.e., t = — 4 (mod 27). Altogether, x = 11 ■ 27 ■ s - 4 ■ 11 - 3 = 297s - 47 for seZ, i.e., x = -47 (mod 297). □
11.C.5. A group of thirteen pirates managed to steal a chest full of gold coins (there were around two thousand of them). The pirates tried to divide them evenly among themselves, but ten coins were left over. They started to fight for the remaining coins, and one of the pirates was deadly stabbed during the combat. So, they tried to divide the coins evenly once again, and now three coins were left. Another pirate died in a subsequent battle for the three coins. The remaining pirates tried to divide the coins evenly for the third time, now successfully. How many coins were there in the chest?
Solution. The problem leads to the following system of congruences:
x = 10 (mod 13), x = 3 (mod 12), x =  0   (mod 11).
Its solution is a; = 231 (mod 11-12-13). Since the number a; of coins is to be around 2000 and x = 231 (mod 1716), we can easily settle that there were exactly 231 + 1716 = 1947 coins. □
11.C.6. When gymnasts made groups of eight people, three were left over. When they formed circles, each consisting of seventeen people, seven remained; and when they grouped into pyramids (each of them contains 21 = 42 + 22 + 1 gymnasts), two of them were incomplete (missing a person "on the top"). How many gymnasts were there, provided there were at least 2000 and at most 4000?
satisfies the system if and only if it satisfies the congruence
x = c   (mod [7711,7712]).
Proof. If the given system is to have a solution x e Z, we must have x = c\ (mod d), x = c2 (mod d), and thus ci = c2 (mod d) as well. Hence it follows that the system cannot have a solution when cx ^ c2 (mod d).
From now on, suppose that c\ = c2 (mod d). The first congruence of the system is satisfied by those integers x which are of the form x = c\ + tm1, where t e Z is arbitrary. Such an integer x satisfies the second congruence of the system if and only if c\ + tm1 = c2 (mod m2), i.e., tm1 = c2 — ci (mod m2). By the theorem about solutions of linear congruences, this congruence (in variable t) is solvable since d = (m1, m2) divides c2 — c\, and t satisfies this congruence if and only if
i.e., if and only if
x = ci + tmi = ci + (c2 — ci) ■ ^—J + r—-—
= c + r ■ [mi, m2], where r G Zis arbitrary and
c = ci + (c2 — ci) ■ (mi/d)^m2/d\ as mim2 equals d ■ [mi,m2]. We have thus found such an integer c that every x e Z satisfies the system if and only if x = c (mod [mi, m2]), as wanted. □
We can notice that the proof of this theorem is constructive, i.e., it yields a formula for finding the integer c. This theorem thus gives us a procedure how to catch the condition that an integer x satisfies a given system by a single congruence. This new congruence is then of the same form as the original one. Therefore, we can apply this procedure to a system of more congruences - first, we create a single congruence from the first and second congruences of the system (satisfied by exactly those integers x which satisfy the original two); then, we create another congruence from the new one and the third one of the original system, and so on. Each step reduces the number of congruences by one; after a finite number of steps, we thus arrive at a single congruence which describes all solutions of the given system.
It follows from the procedure we have just mentioned (supposing the condition from below holds) that a system of congruences always has a solution, and this is unique.
Theorem (Chinese remainder theorem). Let m1mk e N be pairwise coprime, ai,..., ak £Z. Then, the system
x = ai   (mod mi),
x = ak   (mod mj,) has a unique solution modulo mi ■ m2 ■ ■ ■ mk.
762
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution. We solve the following system of linear congruences in the standard way:
c =   3   (mod 8), c =   7   (mod 17), c = -2   (mod 21),
leading to the solution c = 1027 (mod 2856), which, together with the additional information, implies that there were exactly 3883 gymnasts. □
11.C.7. Find which of the following (systems of) linear congruences has a solution.
i) x =   1 (mod 3), x = —1 (mod 9);
ii) 8a; = 1 (mod 12345678910111213);
iii) x = 3 (mod 29),
x = 5 (mod 47). O
The Chinese remainder theorem can also be used "in the opposite direction", i.e. to simplify a linear congruence provided we are able to express the modulus as a product of pair-wise coprime factors.
11.C.8.   Solve the congruence 23 941a; = 915 (mod 3564).
Solution. Let us factor 3564 = 22 ■ 34 ■ 11. Since none of the integers 2, 3, lldivides 23 941, we have (23 941,3564) = 1, so the congruence has a solution. Since p (3564) = 2-(33-2)-10 = 1080, the solution is of the form x = 915 ■ 239411079 (mod 3564). However, it would take much effort to simplify the right-hand integer to a more explicit form. Therefore, we will try to solve the congruence in a different way - we will build an equivalent system of congruences which are easier to solve than the original one.
We know that an integer a; is a solution of the given congruence if and only if it is a solution of the system
23941a; = 915 (mod 22), 23941a; = 915 (mod 34), 23941a; = 915   (mod 11).
Solving these congruences separately, we get the following, equivalent system:
x =   3   (mod 4), x = —3   (mod 81), x = -4   (mod 11).
Remark. The unusual name of this theorem comes from Chinese mathematician Sun Tzu of the 4th century. In his text, he asked for an integer which leaves remainder 2 when divided by 3, leaves remainder 3 when divided by 5, and again remainder 2 when divided by 7.
The answer is rumored to be hidden in the following song:
3&^3fc SunziGe
—    Jfi .5, ## m J^.
Proof. It is a simple consequence of the previous proposition about the form of the solution of a system of two congruences. However, this result can also be proved directly, as shown in exercise 1 l.C.2. □
Let us emphasize that this is quite a strong theorem (which is actually valid in much more general algebraic structures), which allows us to guarantee that for any remainders with respect to given (pairwise coprime) moduli, there exists an integer with the given remainders.
11.4.5. Higher-order congruences.   Now, let us get back 1 to the more general case of congruences.
f(x) = 0   (mod m),
where / (x) is a polynomial with integer coefficients and m e N. So far, we have only one method at our disposal, which is tedious, yet universal - to try all possible remainders modulo m. When solving such a congruence, it is sufficient to find out for which integers a, 0 < a < m, it holds that /(a) = 0 (mod m). The disadvantage of this method is its complexity, which increases as m does. If m is composite, i.e., m = p"1 .. ■pk"k, where p±, ..., pk are distinct primes, and k > 1, we can replace the original congruence by the system of congruences
f(x) = 0 (modpf),
f(x) = 0 (modp£*),
which has the same set of solutions. However, we can solve the congruences separately. The advantage of this method is in that the moduli of the congruences of the system are less than the modulus of the original congruence.
Example. Consider the congruence
a;3-2a;+ 11 = 0   (mod 105). If we were to try out all possibilities, we would have to com-
pute the value off(x]
-2a+ll for the 105 values /(0),
/(l),..., /(104). Therefore, we better factor 105 = 3 ■ 5 ■ 7
763
CHAPTER 11. ELEMENTARY NUMBER THEORY
Now, the procedure for finding a solution of a system of congruences yields x = —1137 (mod 3564), which is the solution of the original congruence as well. □
11.C.9.   Solve the congruence 3446a; = 8642 (mod 208). Solution. We have 208 = 24 13 and (3446, 208) = 2 | 8642. Therefore, the congruence has two solutions modulo 208 and it is equivalent to the system
3a; =   1   (mod 8), x = 10   (mod 13).
The solutions of this system are x = 75 and x = 179
(mod 208). □
11.C.10. Prove that the sequence (2™ — ?>)n°=1 contains infinitely many multiples of 5 as well as infinitely many multiples of 13, yet there is no multiple of 65 in it. O
Residue number system. When calculating with large inte-\\N i'/, gers, it is often more advantageous to work not with
fr their decimal or binary expansions, but rather with their representation in a so-called residue number system, which allows for easy parallelization of computations with large integers. Such a system is given by a fc-tuple of (usually pairwise coprime) moduli, and each integer which is less than their product is then uniquely representable as a k-tuple of remainders (whose values do not exceed the moduli).
ll.C.ll. The quintuple of moduli 3,5,7,11,13 can serve to uniquely represent integers which are less than their product (i.e. less than 15015) and to perform standard arithmetic operations efficiently (and in a distributed manner if desired). Now, we will determine the representation of the integers 1234 and 5678 in this residue number system and we will determine their sum and product.
Solution. Calculating the remainders of the given integers upon division by the particular moduli, we get their RNS representations, which can be written as the tuples (1,4,2, 2,12) and (2,3,1,2,10).
The sum is computed componentwise (reducing the results modulo the appropriate number), leading to the tuple (0,2,3,4,9). Using the Chinese remainder theorem, this tuple can then be transformed back to the integer 6912. The product is computed analogously, yielding the corresponding tuple (2,2,2,4,3), which can be transformed back to 9662 (by the Chinese remainder theorem again). This is indeed congruent to 1234 ■ 5678 modulo 15015. □
and solve the congruences f(x) = 0 for moduli 3, 5, and 7. We evaluate the polynomial f(x) in convenient integers:
X	-3	-2	-1	0	1	2	3
/(*)	-10	7	12	11	10	15	32
The congruence f(x) = 0 (mod 3) thus has solution x = — 1 (mod 3) (only the first one of the integers 12,11,10 is a multiple of 3); the congruence f(x) = 0 (mod 5) has solutions x = 1 and x = 2 (mod 5); finally, the solution of the congruence f(x) = 0 (mod 7) is x = —2 (mod 7).
It remains to solve two systems of congruences:
x = —1   (mod 3), x = —1   (mod 3),
x = 1 (mod 5), and x = 2 (mod 5), x = —2   (mod 7) x = —2   (mod 7).
Solving these systems, we can find out that the solutions of the given congruence f(x) = 0 (mod 105) are exactly those integers x which satisfy x = 26 (mod 105) or x = 47 (mod 105).
It is not always possible to replace the congruence with a system of congruences modulo primes, as in the above example: if the original modulus is a multiple I' of a higher power of a prime, then we cannot "get ■i rid" of this power. However, even such a congruence modulo a power of prime need not be solved by examining all possibilities. There is a more efficient tool, which is described by the following theorem.
11.4.6. Theorem (Hensel's lemma). Let p be a prime,
f(x) £ Z[x], a £ Z such thatp \ f{a),p\ /'(a). Then, for every n £ N, the system
x = a   (mod p),
f(x) = 0   (mod pn)
has a unique solution modulo pn.
Proof. We will proceed by induction on n. In the case of n = 1, the congruence f(x) = 0 (mod p1) is only another formulation of the assumption that the integer a satisfies p f(a). Further, let n > 1 and suppose the proposition is true for n — 1. If a; satisfies the system for ti, then it does so for n — 1 as well. Denoting one of the solutions of the system for ti — 1 as c„_i, we can look for the solution of the system for ti in the form
x = c„_i + k ■ pn~x,   where k £ Z.
We   need   to   find   out   for   which   k   we have
/ (c„_i + k ■ p71-1) = 0 (mod pn). We know that pn~x | / (c„_i + k ■ pn~L). Now, we use the binomial theorem for f(x) = amxm + ■ ■ ■ + a±x + ao, where a0,..., am £ Z. We have
(c„_i + k ■ p^Y = <_x + i ■ ■ kpn~x (mod pn), hence
/ (c-! + k ■ p™-1) = /(c^x) + k ■ p™"1/'^!).
764
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.C.12.   In practice, the residue number system is often a Therefore, triple 2™ — 1,2™, 2™ + 1 (why are these integers always co-prime?), which can uniquely cover integers of 3n bits at the utmost.
Consider the case n = 3 and determine the representation of the integer 118 in this residue number system.
/(c^x+fc-p™-1)
0 =
0   (mod pn
p"
+ k ■ /'(c„_i)   (mod p).
Solution. We can directly calculate that 118 = 6 (mod 7), 118 = 6 (mod 8), and 118 = 1 (mod 9). The wanted representation is thus given by the triple (6,6,1).
In practice, however, it is very important that the RNS representation can be efficiently transformed to binary and vice versa. In our concrete case, the remainder of 118 = (1110110)2 when divided by 23 can be found easily - it is the last three bits (110)2 = 6. Computing the remainder upon division by 23 + 1 = 9 or 23 — 1 = 7 is not any more complicated. We can see (splitting the examined integer into three groups of n bits each) that
(1110110)2 = (001)2 + (110)2 + (H0)2 = 6 (mod23 - 1), (1110110)2 = (001)2 - (110)2 + (H0)2 = 1 (mod23 + 1).
A thoughtful reader has surely noticed the similarity with the criteria for divisibility by 9 and 11, which were discussed in paragraph ll.B.9. □
11.C.13. Higher-order congruences. Using the procedure
^te, of theorem 11.4.6, solve the congruence
+ 7a;+ 4 = 0   (mod 27).
Solution. First, we will solve this congruence modulo 3 (by substitution, for instance) - we can easily find that the solution is x = 1 (mod 3). Now, writing the solution in the form x = 1 + 3t, where t e Z, we will solve the congruence modulo 9:
x4 + 7x + 4 = (1 + 3i)4 + 7(1 + 3t) + 4 = l + 4-3i + 7 + 7-3/j + 4 = 33i = lit = t =
0 (mod 9),
0 (mod 9),
0 (mod 9), -12 (mod 9), - 4 (mod 3),
1 (mod 3).
Since c„_i = a (mod p), we get /'(c„_i) = /'(a) ^ 0 (mod p), so (/'(c„_i),p) = 1. By the theorem about the solutions of linear congruences, we can hence see that there is (modulo p) a unique solution k of this congruence, and since c„_ 1 was, by the induction hypothesis, the only solution modulo the integer c„_i +k-pn~1 is the only solution of the given system modulo pn. □
Example. Consider the congruence
3a;2+4 = 0   (mod 49).
The congruence can be equivalently transformed (by solving the linear congruence 3y = 1 (mod 49) and multiplying both sides of the congruence by the integer y = 33) to the form a;2 = 15 (mod 72). Then, we proceed as in the constructive proof of Hensel's lemma.
First, we solve the congruence a;2 = 15 = 1 (mod 7), which has at most 2 solutions, and those are a; = ±1 (mod 7). These solutions can be expressed in the form x = ±1 + 7t, where ( 6 Z, and substituted into the congruence modulo 49, whence we get the solution x = ±8 (mod 49) (if we were interested solely in the number of solutions, we would not even have to finish the calculation as it follows straight from Hensel's lemma that every solution modulo 7 gives a unique solution modulo 49 because for
15, we have 7 f/'(+!)).
11.4.7. Congruences modulo a prime. The solution of general higher-order congruences has thus been reduced to the solution of congruences modulo a prime. As we will see, this is where the stum---■Z*. bling block is since no (much) more efficient universal procedure than trying out all possibilities is known. We can at least mention several statements describing the solvability and number of solutions of such congruences. We will then prove some detailed results for some special cases in further paragraphs.
Theorem. Let p be a prime, f(x) G Z[x\. Every congruence f(x) =0 (mod p) is equivalent to a congruence of degree at most p — 1.
Proof. Since it holds for any a e Z that p
a (sim-
ple consequence of Fermat's little theorem), the congruence xp — x = 0 (mod p) is satisfied by all integers. Dividing the polynomial j(x) by xp — x with remainder, we get
f(x) = q(x) ■ (xp — x) + r(x)
for suitable f(x),r(x) e Z, where the degree of r(x) is less than that of the divisor, i.e. p. We thus get that the congruence r(x) = 0 (mod p) is equivalent to the congruence j(x) = 0 (mod p), yet it is of degree at most p—1. □
765
CHAPTER 11. ELEMENTARY NUMBER THEORY
Writing t = 1 + 3s, where s G Z, we get x = 4 + 9s, and substituting this leads to
(4 + 9s)4 + 7(4 + 9s) + 4 =	0	(mod 27),
44 + 4 ■ 43 ■ 9s + 28 + 63s + 4 =	0	(mod 27),
256 ■ 9s + 63s =	-288	(mod 27),
256s + 7s =	- 32	(mod 3),
2s =	1	(mod 3),
s =	2	(mod 3).
Altogether, we get the solution in the form x		= 4 + 9s =
4 + 9(2 + 3r) = 22 + 27r, where	r G Z,	i.e., x = 22
(mod 27).		□
11.C.14. Knowing a primitive root modulo 41 from exercise 11.B.29 , solve the congruence
7x17 = 11   (mod 41).
Solution. Multiplying the congruence by 6, we get an equivalent congruence 42x17 = 66, i.e., x17 = 25 (mod 41). Since 6 is a primitive root modulo 41, the substitution x = 6* leads to the congruence 6m = 25 = 64 (mod 41), which is equivalent to 171 = 4 (mod 40), and this holds if and only if i = 12 (mod 40). Therefore, the congruence is satisfied by exactly those integers x with x = 612 = 4 (mod 41). □
11.C.15.   Solve the congruence x5 + 1 = 0 (mod 11). Solution. Since (5, p(ll)) = 5 and
(-1)   5   = 1   (mod 11),
the congruence
x5 = -1   (mod 11)
has five solutions. There are several possibilities how to find them. We can either try all (ten) candidates or transform the problem to a linear congruence using the primitive-root trick. Since 210/2 = -1^1 (mod 11) and 210/5 = 4^1 (mod 11), 2 is a primitive root modulo 11 (see also exercise 11.B.28), and the substitution x = 2V then transforms the congruence to
25y = 25   (mod 11),
which is equivalent to the linear congruence
5y = 5   (mod 10), y = 1   (mod 2).
11.4.8. Theorem. Let p be a prime, f(x) G Z[x]. If the congruence f(x) = 0 (mod p) has more than deg(f') solutions, then each of the coefficients of the polynomial f is a multiple ofp-
Proof. In algebraic words, we are actually interested in the number of roots of a non-zero polynomial over a finite field Zp, and by 12.2.4, there are at most deg(/) of them. □
11.4.9. Binomial congruences. This part will be devoted to solving special types of higher-order polynomial congruences, the so-called binomial congruences. It is an analog to
• the binomial equations, where the polynomial / (x) is xn —a. It can easily be shown that we can restrict ourselves to the condition that a be coprime with the modulus of the congruence - otherwise, we can always equivalently transform the congruence into this form or decide that it has no solution.
Quadratic and power residues
Let m G N, a G Z, (a,m) = 1. The integer a is said to be a n-th power residue modulo m, or residue of degree n modulo m iff the congruence
xn = a   (mod m)
is solvable. Otherwise, we call a a n-th power nonresidue modulo m, or nonresidue of degree n modulo m.
For n = 2, 3,4, we use the adjectives quadratic, cubic, and quartic residue (or nonresidue) modulo m.
Now, we will show how to solve binomial congruences modulo m, if there are primitive roots modulo m (in particular, when the modulus is an odd prime or its power).
11.4.10. Theorem. Let m G N be such that there are primitive roots modulo m. Further, let a G Z, (a, m) = 1. Then, the congruence xn = a (mod m) is solvable (i.e., a is an n-th power residue modulo m) if and only if a'p<-m^d = 1 (mod m), where d = (n, p(m)). And if so, it has exactly d solutions.
Proof. Let g be a primitive root modulo m. Then, for any x coprime to m, there is a unique integer y (its discrete logarithm) with the property 0 < y < <p(m) such that x = gy (mod m). Similarly, for a given a, there is a unique b G Z; 0 < b < ip(m) such that a = gb (mod m). After this substitution, the binomial congruence in question is thus equivalent to the congruence (gv)n = gb (mod m) and, invoking theorem 11.3.3, to the linear congruence n ■ y = b (mod p(m)) as well.
However, this congruence
n ■ y = b   (mod p(m))
is solvable if and only if d = (n, <p(m)) | b (and if so, it has d solutions).
766
CHAPTER 11. ELEMENTARY NUMBER THEORY
This congruence is satisfied by y G {—3,—1,1,3,5}; the original congruence is thus (substituting x = 2y (mod 11)) satisfied by x e {-1,2,-3,-4,-5}. □
11.C.16.   Solve the congruence
(mod 105).
3a; + 5
Solution. Since the modulus can be written as 105 = 3 ■ 5 ■ 7, where the factors are pairwise coprime, the congruence in question is equivalent to the following system:
a;3-3a;+ 5 = 0 (mod 3), a;3 - 3a; + 5 = 0 (mod 5), a;3-3a;+ 5 = 0   (mod 7).
Clearly, the first congruence is equivalent to i3 = 1 (mod 3), and that one is equivalent to x = 1 (mod 3) as it follows from Fermat's little theorem that x3 = x (mod 3) holds for all integers x.
The second congruence is equivalent to x (x2 — 3) = 0 (mod 5), which is satisfied iff x = 0 (mod 5) or x2 = 3 (mod 5). However, since 3 is a quadratic nonresidue modulo 5 (the Legendre symbol (3/5) is equal to -1), we get that x = 0 (mod 5) is the only solution of the second congruence of the system.
The third congruence can be transformed to the form a;3 — 3a; — 2 = 0 (mod 7), which is satisfied iff a; = —1 (mod 7) or x = 2 (mod 7) (since the left-hand side factors as a;3 — 3a; — 2 = (x — 2){x + l)2). Of course, this can also be found out by examining all possibilities modulo 7. Altogether, there are two solutions of the original congruence modulo 105: x = 55 and x = 100. □
11.C.17. Determine the number of solutions of the congruence s~^-^>
A '        " = 534   (mod 232
Solution. The given congruence is equivalent to a;5 = 5 (mod 232), and since we have (5,^(23)) = 1, it follows from the theorem on solvability of binomial congruences that the congruence has a unique solution if considered modulo 23. Furthermore, this solution is surely not a multiple of 23. Therefore, considering the polynomial whose roots we are
It remains to prove that d | b if and only if a,Am)/d = l (mod m). However, the congruence
1 = aV>{m)/d = gbV(m)/d     (mod m)
is true if and only if <p(m) ifd\b.
bip(m)
, which happens if and only
□
Corollary. If the assumptions of the above theorem hold and, moreover, (n, ip(m)) = 1, the congruence xn = a (mod m) always has a unique solution. In other words, exponentiation to the n-th power (where n is coprime to ip(m)) is a bijection on the set of invertible residue classes modulo m (it is even an automorphism of the group (Z^, ■')).
11.4.11. Quadratic congruences and the Legendre symbol. Now, our task is to find an efficient condition determining whether a quadratic congruence
aa;2 + bx + c = 0   (mod m)
is solvable (and if so, how many solutions it has). It can easily be seen from the presented theory that if we want to decide whether this congruence is solvable, it suffices to decide this for the (binomial) congruence
x2 = a   (mod p),
where p is an odd prime and a is an integer coprime to it. A congruence modulo a composite m can be decomposed to an equivalent system of congruences modulo the particular factors of the integer m, which are powers of primes. Such congruences can be transformed to quadratic congruences with prime modulus using the procedure described in Hensel's lemma 11.4.6. Norming this congruence and completing the square then results in the aforementioned form.
To decide the solvability of a congruence, we can, of course, use the theorem 11.4.10 about the solvability of binomial congruences. Its application is, however, often limited by time resources; we will thus try to find a criterion which will be computationally easier in (not only) the quadratic case.
Example. Let us determine the number of solutions of the congruence a;2 = 219 (mod 383).
Since 383 is a prime and (2,^(383)) = 2, it follows from theorem 11.4.10 that the given congruence is solvable (and it has 2 solutions) if and only if 219^^ = 219191 = 1 (mod 383). It is not easy to verify this proposition without some computational power (though, this can still be calculated on a "piece of paper"). However, we will show that this condition can be verified much more easily using the properties of the so-called Legendre symbol.
767
CHAPTER 11. ELEMENTARY NUMBER THEORY
looking for, its derivative (a;5 — 5)' = 5a;4 does not evaluate to a multiple of 23 at the wanted solution, either. Invoking Hensel's lemma, we can summarize that the original congruence has a unique solution (without having to describe it explicitly). □
11.C.18. Give an example of a polynomial congruence whose degree is less than the number of its solutions.
Solution. Taking into account theorem 11.4.8, we must use either a modulus which is composite or a polynomial all of whose coefficients will be multiples of the modulus.
As an example of a congruence of the first kind, we can
put
x2 = 1   (mod 8),
which is a quadratic congruence with four solutions 1,3, 5,7.
The case if a prime modulus can be exemplified by the quadratic congruence 10a;2 — 15 = 0 (mod 5), which has five solutions. □
11.C.19. Other types of congruences. Prove that for any natural number n, the integer
Legendre symbol
111 + 2" is divisible by 127.
Solution. We are to prove that the congruence
22'2" 1 = -111   (mod 127) is satisfied for every n e N. This congruence is equivalent to
= 22
(mod 127).
Since 27 = 128 = 1 (mod 127), the order of 2 modulo 127 equals 7, so the congruence to be proved is (by 11.3.3) equivalent to
222"-1 = 22     (mod rjy
Similarly, the order of 2 modulo 7 is 3, which leads to the (again equivalent) congruence
(-1)2
2   (mod 3), — 1   (mod 3),
and this is apparently true (we could also have proceed likewise - the order of 2 modulo 3 is 2, and so on). This proves the statement. □
Let p be an odd prime and a an integer. The Legendre symbol is defined by
1   forp \ a, a is a quadratic residue modulo p,
0   forp | a,
—1   if a is a quadratic nonresidue modulo p.
The Legendre symbol is also often written as (a/p) and usually read as "a on p".
Example. Since the congruence a;2 = 1 (mod p) is solvable for an arbitrary odd prime p, we have (l/p) = 1. Further,
(—1/5) = (4/5) = 1, because the congruence a;2 = —1 (mod 5) is equivalent to the congruence a;2 = 4 (mod 5), whose solutions are x = ±2 (mod 5).
The statement of the following lemma will be very often used when evaluating the Legendre symbol in practice.
11.4.12. Lemma. Let p be an odd prime, a,b G Z arbitrary. Then:
(1) fj) == (mod p).
(2) (£) = ©(*).
(3) Ifa = b (mod p), then ® = (£).
Proof. (1) The statement is clear for p | a; if a is a quadratic residue modulo p, then the statement follows from the theorem about the solvability of quadratic congruences, which claims (in this case, we have (p(p),2) = 2) that the necessary and sufficient condition for the congruence x2 = a (mod p) to be solvable that
= 1   (mod p).
The same theorem implies for the case of a quadratic non-residue as well that we have ^ 1 (mod p). However, then (since we have p | ap_1 — 1 = (a^ —
l)(ap2 -f l) by Fermat's theorem), necessarily p a 2        i.e., a 2
-1 (mod p).
(mod p).
(2) From (1), we have
— = tab) 2 = a 2 b 2 = __ ,V) \PJ\PJ However, since the values of the Legendre symbol belong to the set {—1,0,1}, this congruence immediately implies that the left and right sides are equal.
(3) Apparent from the definition.
□
Corollary. (1) Any reduced residue system modulo p contains the same number of quadratic residues as non-residues.
(2) The product of two quadratic residues as well as the product of two quadratic nonresidues is a residue; the product of a residue and a nonresidue is a nonresidue.
(3) (—l/p) = (—1) 2 , i.e., the congruence
(mod p) is solvable if and only if p = 1 (mod 4).
768
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.C.20.   Determine which natural numbers n satisfy that S\\N i'/. the integer n ■ 2n + 1 is divisible by seven.
^    Solution. We are looking for the solution of the congruence
n-2n = -l   (mod 7).
We should be aware of the fact that we cannot use the theorem 11.4.1 since n ■ 2n is not a polynomial in variable ti, so it is not guaranteed (and it is even not true) that the expression will yield the same remainder modulo 7 when evaluated at integers which are congruent modulo 7.
On the other hand, we can notice that the order of 2 modulo 7 is equal to 3, so we can split the problem into three cases according to the remainder of n when divided by 3.
For n = 0 (mod 3), we have 2" = 1 (mod 7), so the congruence in question is equivalent to n = —1 (mod 7). Combining the conditions n = 0 (mod 3) and n = — 1 (mod 7) in the Chinese remainder theorem leads to the solution n = 6 (mod 21).
Now, for n = 1 (mod 3), we have 2n = 2 (mod 7), so the examined congruence is of the form 2n = — 1 (mod 7), which is equivalent to n = 3 (mod 7). The conditions n = 1 (mod 3) and n = 3 (mod 7) are satisfied iff n = 10 (mod 21).
Finally, for n = 2 (mod 3), we have 2n = 4 (mod 7), and the solution of the congruence 4ti = — 1 (mod 7) is ti = 5 (mod 7). Altogether, n = 5 (mod 21).
The problem is satisfied by exactly those natural numbers n with n = 5,6,10 (mod 21). □
11.C.21.   Prove that for any natural number ti, the integer , 2ti4 + ti3 + 50 is divisible by 6 if and only if the integer 2 ■ 4™ + 3™ + 50 is divisible by 13.
\ Solution. The expression / (ti) = 2ti4 + ti3 + 50 ^— is a polynomial in variable ti, so in this case, we can make use of theorem 11.4.1, i.e., it suffices to go through all possible remainders modulo 6. Since the order of 4 modulo 13 is equal to 6 and the order of 3 modulo 13 equals 3, it is enough (by 11.3.3) to examine the remainder of ti upon division by 6 in the latter case as well. In the former case, we calculate
0   1   2   3   4 5
Proof. (1) Considering the elements of a reduced residue system modulo p (we can take, for instance, the set {-^,...,-1,1,...,^}), the quadratic residues are exactly those integers which are congruent to one of (±1)2,..., (i2^)2.   Thus there
are exactly        of quadratic residues, so there are
p—i _      p—i
of the other ones (the quadratic
P-1 - 2
nonresidues).
(2) This follows immediately from part (2) and the previous lemma.
(3) It follows from part (1) of the lemma that (—1/p) = (—1) ^ (mod p); both sides, however, take on the values ±1, so they must be equal.
□
These basic statements about the values of the Legendre symbol are already sufficient for proving the theorem on the infinitude of primes of the form S 4k + 1 (see paragraph 11.2.5).
Proposition. There are infinitely many primes of the form
4k + I.
Proof. We will proceed by contradiction. Suppose that pi, p2, • • •, pe is the enumeration of all primes of the form 4k + 1, and consider the integer N = (2p1 ■ ■ ■ pi)2 + 1. This integer is of the form 4k+l as well. The assumption that N is a prime would lead to an immediate contradiction, since N is surely greater than any of the integers p±,p2, ■ ■ ■ ,pe- Therefore, from now on, let us suppose that it is thus composite. Then, there must exist a prime p which divides N. Apparently, none of the primes 2, p1, p2,..., pi divides N, so we will be finished if we prove that p is also of the form 4k + 1. It follows from the congruence (2p1 ■ ■ -pi)2 = —1 (mod p), that (—l/p) = 1, and this is true (by the previous corollary) if and only if p = 1 (mod 4). Altogether, we have reached a contradiction (a prime p not belonging to the original list of all primes of the form 4k + 1) in the case of composite N as well, which proves that there are infinitely many such primes. □
The most important theorem which allows us to efficiently compute the value of the Legendre symbol (and thus determine the solvability of a quadratic congruence), is the so-called law of quadratic reciprocity.
Law of quadratic reciprocity
/(ti)   mod 6
11.4.13. Theorem. Let p, q be odd primes. Then,
(1) (f) = (-1)^.
(2) (D = (-1)^,
(3) (f) = (f)-(-l)E^J^-
2   5   0   5   2 3
769
CHAPTER 11. ELEMENTARY NUMBER THEORY
Therefore, the congruence f(n) = 0 (mod 6) is satisfied by exactly those natural numbers n which satisfy n = 2
(mod 6).
In the latter case, we gradually compute that
n		0	1	2	3	4	5
4P	mod 13	1	4	3	-1	-4	-3
3™	mod 13	1	3	9	1	3	9
2 ■ 4™ + 3™ - 2	mod 13	1	9	0	-3	-7	1
Just like in the former case, the congruence 2-4n+3n+50 = 0 (mod 13) is satisfied if and only if n = 2 (mod 6). □
11.C.22. Another proof of Wilson's theorem. Prove that if p is a prime, then
(p — 1)! = —1   (mod p).
Solution. The statement is clearly true for p = 2, so we can consider only odd primes p from now on. By Fermat's little theorem, the congruence
The theorem is put this way mainly because we can calculate the value (a/p) for any integer a using these three formulae and the basic rules for the Legendre symbol.
Many proofs can be found in literature8. However, many of them (especially the shorter ones) usually make use of deeper knowledge from algebraic number theory. We will present an elementary proof of this the-i oremhere.
Let S denote the reduced residue system of the least residues (in absolute value) modulo p, i.e.,
g _ r  p-t    p-3      _i i       p-3 p-i i
l       2   '       2   ' ' ' ' '     x, x,• • • ,    2'    2   J '
Further, for a G Z, p \ a, let pp(a) denote the number of negative least residues (in absolute value) of the integers
1 ■ a, 2 ■ a,
p-1
(x-l)(x-2)---(x-(p-l))-(XP-1-l) = 0   (modp)    5 ={-5,
is satisfied by any integer a which is not divisible by p; i.e., there are p — 1 solutions. However, its degree is equal to p — 2 (which is less than the number of solutions). It follows from 11.4.7 that all of the coefficients of the left-hand polynomial are multiples of p. In particular, this applies to the absolute term, which equals (p—1)1+1. This proves Wilson's theorem.
□
i.e., we decide for each of these integers to which integer from the set S it is congruent and count the number of the negative ones. If it is clear from context which values a,p we mean, we will usually omit the parameters and write only p instead
of fip(a).
Example. We determine pp(a) for the prime p = 11 and the integer a = 3.
Now, the reduced residue system we are interested in is
,-1,1,
1
2
3
4
5
., 5}, and for a = 3, we calculate
3
-5 -2 1
4
(mod 11) (mod 11) (mod 11) (mod 11) (mod 11),
11.C.23.   Solve the congruence
x2 = 18   (mod 63).
Solution. Since (18,63) = 9, it must be that 9 | x2, i.e., 3 | x. Setting x = 3x1, x1 G Z, we get an equivalent congruence x\ = 2 (mod 7), which already satisfies that the modulus is coprime to the integer on the right-hand side. It follows from theorem 11.4.8 that this congruence has at most 2 solutions, and those are clearly x\ = ±3 (mod 7), i.e., xx = ±3, ±10, ±17, ±24, ±31, ±38, ±45, ±52, ±59 (mod 63). The solution of the original congruence is thus x = 3xi (mod 63), i.e., X = ±9, ±12, ±30 (mod 63). □
whence /in (3) = 2.
We will show in the following statement that this integer is tightly connected to the Legendre symbol - the value of the symbol (3/11) can be determined in terms of the p function
as (-l)Mn(3) = (_1)2 = L
Lemma (Gauss). Ifp is an odd prime, a G Z, p \ a, then the value of the Legendre symbol satisfies
Proof. For each integer i G {1,2,..., 1~-}, we set a value rrii G {1,2,..., so that i ■ a = ±m, (mod p). We can easily see that if k,l G {1,2,..., } are different, then the values m^, mj are also different (the equality = mi would imply that k ■ a = ±1 ■ a (mod p), and hence k = ±1 (mod p), which cannot be satisfied unless k = /)■
In 2000, F. Lemmermeyer stated 233 proofs - see F. Lemmermeyer, Reciprocity laws. From Euler to Eisenstein, Springer. 2000
770
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.C.24.   Solve the congruence
x3 = 3   (mod 18).
Solution. Since (3,18) = 3, we must have 3 | x. Making the substitution x = 3 ■ x\, similarly to the above exercise, we get the congruence
27^ = 3   (mod 18),
and
which has no solution since (27,18) \ 3.
□
11.C.25. Quadratic congruences. First of all, we introduce -0i> several problems which illustrate that the Ja-{vL*''>/ir[ cobi symbol has properties similar to the Le-gendre one, which relieves us of the necessity to factor the integers that appear when working with the Legendre symbol.
Prove that all odd positive numbers b, b' and all integers a,a1,a2 satisfy (the symbols used here are always the Jacobi ones):
i) if ai
i") (&) = (!)(£
a2 (mod &),then (^)
(if)
Solution. All of the results can be proved directly from the definition of the Jacobi symbol and the multiplicativity of the Legendre symbol. □
11.C.26.   Prove that if a, b are odd natural numbers, then
i) = ==i+fc=i (mod 2),
ii) s^=l = <£=1 + >£=1 (mod 2).
Solution.
i) Since the integer (a — 1) (& — 1) = (ab — 1) — (a — 1) — (b— 1) is amultipleof 4, we get (ab—1) = (a—1)+(6— 1) (mod 4), which gives what we want when divided by two.
ii) Similarly to above, (a2 - 1)(&2 - 1) = (a2b2 - 1) -— (a2 — 1) — (b2 — 1) is a multiple of 16. Therefore, (a2b2 - 1) = (a2 - 1) + (b2 - 1) (mod 16), which gives the wanted statement when divided by eight (see also exercise 11.A.2). □
11.C.27. Prove that if a1,..., are odd natural numbers, then
i) "H^"1 = Eti ^ (mod 2),
ii) "H"'"1 = Eti ^ (mod 2).
Therefore,      the     sets {1,2,...,^} (mi,ra2,...,rati] coincide, which is also illustrated
2
by the above example. Multiplying the congruences
1 ■ a = ±mi   (mod p),
2 ■ a = ±77i2   (mod p),
p-1
a = ±mP
(mod p)
leads to
■        = (-1)" ■ (mod p),
since there are exactly /j negative values on the right-hand sides of the congruences. Dividing both sides by the integer     !, we get the wanted statement, making use of lemma
11.4.12, whence (a/p) = a 2   (mod p).
□
Now, with the help of Gauss's lemma, we will prove the law of quadratic reciprocity.
Proof of the law of quadratic reciprocity. The first part has been already proven; for the rest, we first derive a lemma which will be utilized in the proof of both of the remaining parts.
Let a e Z, p \ a, k e N and let [x] and (x) denote the integer part (i.e. floor) and the fractional part, respectively, of a real number x. Then,
'2ak'			ak	n/ak\		ak		n/ak\
		2	—	+ 2( —)	= 2	—	+	2( —)
P .			. P .	\P		. P .		\P
This expression is odd if and only if (^} > \, which is iff the least residue (in absolute value) of the integer ak modulo p is negative (a watchful reader should notice the return from the calculations of (ostensibly) irrelevant expressions back to the conditions close to the Legendre symbol). The integer nP(a) thus has the same parity (is congruent to, modulo 2) as whence (thanks to Gauss's lemma) we get that
(-l)Ma) = (_i)
v 2
Furthermore, if a is odd, then a + p is even and we get
2       /a+£\
2a + 2p\ (42±2
P
P
= (-l)£*=ai
= (-i)E^! Lit J . (-l)£fc=2! k.
k is
Since the sum of the arithmetic series J2k=i 5 2fi      = ^f1, we get (for a odd) the relation
'2\ /(jA kPJ \PJ
which, for a = 1, gives the wanted statement of item 2.
= (_i)E*ii m .(-1)1
771
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution. In light of the previous exercise, both statements can be proved easily by mathematical induction. □
11.C.28. Prove the law of quadratic reciprocity for the Ja-cobi symbol, i.e., prove that if a, b are odd natural numbers, then
i) (^) = (-l)^. © = (-1)^. i") (f) = ® ■(-!)"■
Solution. Let (just like in the definition of the Jacobi symbol) a factor to (odd) primes as pip2 ■ ■ - pk-i) The properties of the Legendre symbol and the aforementioned statement imply that
a /       \Pl /    \P2 / \Pk
PI — 1 pu — 1
= (-1)^3-... (-!)-■*- =
= (-1)^ ^ =
= (-i)sH-i = (-i)^.
ii) Analogously to above.
iii) Further, let b factor to (odd) primes as q\q2 ■ ■ ■ qe- If we have pi = qj for some i and j, then the symbols on both sides of the equality are equal to zero. Otherwise, the law of quadratic reciprocity for the Legendre symbol implies that for all pairs (pi, qj), we have
fpi\      f Qi
Qj/ XPj
(-1)
2 2
Therefore,
k I a\ rrrr/p,
nn
b I     ■*■ ■*■■*■■*■ \ <7i
i=ij=i w
nn!i-H
2 = 1 j
k
ik-d
^JZU9-2^ TT (Ql
— 1)     2      ^3 = 1 2
n
ik-d
2 2
n^) — nvp
2 = 1 J = l ^
=(-i)-^^nng
i=ij=i ^
(_1)    2       2 _
By part 2, which we have already proved, and the previous equality, we now get for odd integers a that
(1)
(-1)
Now, let us consider, for given primes p^g, the set T = {q ■ x; x G Z, 1 < x < (p - l)/2} x x {p-y; y eZ, 1 < y < (g - l)/2}.
We will show that we
We apparently have |T also have
(-1)|T| = (-1)
p-i <?-i
2 2 1
(-1)
which will be sufficient thanks to the above.
Since the equality qx = py can happen for no pair of x, y from the permissible domain, the set T can be partitioned to (disjoint) subsets Ti and T2 so that Ti = T n
G Z,u < v}, T2 = T \ Ti. Clearly, |Ti is the number of pairs (qx,py) for which a; <  |y. Since
Tiv < Ti-^ < |, we have Ify
< ^-g-. For a fixed y, in Ti, there are thus exactly those pairs (qx,py) for which
1 < X <   |y|; hence \TX
(?-l)/2 U
y . Analogously,
ir2i=Erii)/2 [H-
By (1), we thus have (|) = (-l)|Tl1 and g) = (-1)IT2I, which finishes the proof of the law of quadratic reciprocity.
□
Corollary. Let p,q be odd primes.
(1) —lis a quadratic residue for primes p which satisfy p = 1 (mod 4) and it is a quadratic nonresidue for primes p satisfying p = 3 (mod 4).
(2) 2 is a quadratic residue for primes p which satisfy p = ±1 (mod 8) and it is a quadratic nonresidue for primes p satisfying p = ±3 (mod 8).
(3) If p = 1 (mod 4) or q = 1 (mod 4), then (p/q) = (q/p); for other oddp,q, we have (p/q) = —(q/p).
Proof. (1) The integer      is even iff 4 | p — 1.
(2) We need to know for which odd primes p the exponent is
2_i
^-g— is even. Odd primes are congruent to ±1 or ±3 modulo 8, so we have (by 11.B.7) that either p2 = 1 (mod 16) orp2 = 9 (mod 16).
(3) This is clear from the law of quadratic reciprocity.
□
772
CHAPTER 11. ELEMENTARY NUMBER THEORY
We utilized the result of part (i) of the previous exercise in the Example. Let us calculate the value (79/101) using the prop-calculations. □   erties of the Legendre symbol.
Applications of the Legendre and Jacobi symbols.
A /3<f^S^        The primary motivation for introduc-/ *<K«       \     tne Jacor,i symbol was the necessity ' ^   to evaluate the Legendre symbol (and
thus to decide the solvability of quadratic congruences) without having to factor integers to primes. We will illustrate this calculation on an example.
11.C.29. Decide whether the congruence x2 = 219 (mod 383) is solvable.
Solution. Since 383 is a prime, the congruence will be solvable if the Legendre symbol will satisfy (219/383) = 1.
79 101
101 ~79 22 79 2
79 11 79
(-1) (-1)
since 101 is congruent to 1 modulo 4
since 79 is congruent to — 1 modulo 8
since 11 = 79 = 3   (mod 4) = 1      since 11 = 3   (mod 8)
383~ J = _ 1      J = (Jacobi) both 383 and 219 leave remainder-fypQl
383 219 164 219 219
Tf
14 41 1_
41 41 Y
-1
T
41 219
164 = 22 - 41
the Legendre symbol (as we saw in the example above) allows us to only use the law of quadratic reciprocity for primes, so it forces us to factor integers to primes, which is a very =   (Jacobi) 41 leaves remainder 1 upon dmsioTT^^ operation from the computational point of
view. This can be mended by extending the definition of the 2 \ / 7 \ Legendre symbol to the so-called Jacobi symbol with similar
41 / V 41 / _ properties.
41 leaves remainder 1 upon division bP§&nmon- Let a G Z, & G N, 2 f &.  Let & factor as
b = p\p2 ■ ■ ■ Pk to (odd) primes (here, we exceptionally do not group the same primes to a power of the prime, rather we
41 leaves remainder 1 upon division by^^ each one explicitly; e.g. 135 = 3 ■ 3 ■ 3 ■ 5). The
symbol
1   7 leaves remainder 3 upon division by 4. fa\ _ ( a \ fa
w " w' W
is called the Jacobi symbol.
□
-1.
11.C.30.   Find all integers which satisfy the congruence
x2 = 7   (mod 43).
Solution. The Legendre symbol evaluates to
J3)= ~ (t) = ~ (7,
Hence it follows that 7 is a quadratic nonresidue modulo 43, so there is no solution of the given congruence. □
11.C.31.   Find all integers a for which the congruence
x2 = a   (mod 43)
is solvable.
Solution. This exercise navazuje na the above one, where we could see that the integer 7 does not satisfy it. We can test all the remainders modulo 43 this way, but there is a simpler
We will show below that the Jacobi symbol has similar properties as the Legendre one. However, there is a substantial aberration - it is not generally true that (a/b) = 1 implies that the congruence x2 = a (mod 6) is solvable.
Example.
vis) = G. but the congruence
(-1) •(-!) = !,
x2 = 2   (mod 15)
has no solution (the congruence x2 = 2 is solvable neither modulo 3 nor modulo 5).
Theorem (Law of quadratic reciprocity for the Jacobi symbol). Let a,b G N be odd integers. Then,
(1) (=i) = (-1)^,
(2) (2)=(-l)^.
(3) (§) = (£)■(-!)".
773
CHAPTER 11. ELEMENTARY NUMBER THEORY
method. The congruence is surely solvable if a is a multiple of 43 (then, it has a unique solution); and if not, it must be a quadratic residue modulo 43. The quadratic residues can be most simply enumerated by calcultaing the squares of all elements of a reduced residue system modulo 43.
The quadratic residues are thus the integers congruent to (±1)2, (±2)2, (±3)2,..., (±21)2 modulo43, so the problem is satisfied by exactly those integers a which are congruent to any one of 1,4, 6, 9,10,11,13,14,15,16,17,21,23,24, 25, 31,35,36,38,40,41. □
11.C.32.   Derive by straight calculation from Gauss's
p) = (-l)V     and    (2/p) = (-l)V.
Solution. To evaluate (—1/p) in the former case, we should realize that p tells the number of least (in absolute value) negative remainders of integers in the set
{-1,-2,...,-*=!}.
However, those are exactly the desired remainders and they
are all negative; hence we have p = *=! and (—1/p) =
p-i (-1) —.
In the latter case, we need to express the number of least (in absolute value) negative remainders of integers in the set
{1-2,2-2,3-2.
p-i
2}
For any fee {1,2,..., *=! }, the integer 2 k leaves a negative remainder if and only if 2k > i.e., iff k > *=!. Now, it remains to determine the number of such integers k.
If p = 1 (mod 4), then this number is equal to
£—! = 2—! so 4        4 »su
—1\ p-1 p-1 p+1 p-1
—j = (-l)M = (-1)— = (-I) — '— = (-1) —
since ^j! is odd in this case.
Similarly, for p = 3 (mod 4), the number of such inte-
gers k equals £f± - ^ = E±i;
1\ p+1 p+1 p-i p-i
—j = (-1)— = (-I) — '— = (-1) —
since 2^! is odd in this case as well.
□
Proof. The proof is simple, utilizing the law of quadratic reciprocity for the Legendre symbol. See exercise 11.C.28. □
There is another application of the law of quadratic reci-Jp i procity in a certain sense - we can consider the question: For which primes is a given integer a a qua-il dratic residue? We are already able to answer this question for a = 2, for example. The first step in answering this question is to do so for primes since the answer for composite values of a depends on the factorization of the integer a.
Theorem. Let q be an odd prime.
• If q = 1 (mod 4), then q is a quadratic residue modulo those primes p which satisfy p = r (mod q), where r is a quadratic residue modulo q.
• If q = 3 (mod 4), then q is a quadratic residue modulo those primes p which satisfy p = ±b2 (mod 4g), where b is odd and coprime to q.
Proof. The first theorem follows trivially from the law of quadratic reciprocity. Let us consider q = 3 (mod 4), i.e., (q/p) = (—l)1^ (p/q). First of all, letp = +b2 (mod 4g), where b is odd, and hence b2 = 1 (mod 4). Then, p = b2 = 1 (mod 4) andp = b2 (mod q). Therefore, (—l)11^ = 1 and (p/q) — 1. whence (q/p) = 1. Now, if p = —b2 (mod 4g), then we similarly get that p = —b2 = 3 (mod 4) and p = —b2 (mod q). Therefore, (—l)11^ = —l and (p/q) = —1, whence we get again that (q/p) = 1.
For the opposite way, suppose that (q/p) = 1. There are two possibilities - either (—l)11^ = l and (p/q) = 1, or (—l)2^- = — land (p/q) = —1. In the former case, we have p = 1 (mod 4) and there is a & such that p = b2 (mod q). Further, we can assume without loss of generality that b is odd (if not, we could have taken b + q instead). However, then we get b2 = 1 = p (mod 4), and altogether p = b2 (mod 4g).
In the latter case, wehavep = 3 (mod 4) and (—p/q) = (-l/q)(p/q) = (-1)(-1) = 1. Therefore, there is a & (which can also be chosen so that it is odd) such that — p = b2 (mod q). We thus get —b2 = 3 = p (mod 4), and altogether p = — b2 (mod Aq). □
5. Applications - calculation with large integers, cryptography
11.5.1. Computation aspects of number theory. In many €Ju        practical problems which utilize the results of 7^-: 1 number theory, it is necessary to execute one or "^j^lk/   more of the following computations fast:
• common arithmetic operations (sum, product, modulo) on integers;
• to determine the remainder of a (natural) n-th power of an integer a when divided by a given m;
• to determine the multiplicative inverse of an integer a modulo mei;
774
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.C.33.   Determine whether the congruence x2  = 38
(mod 165) is solvable.
Solution. The Jacobi symbol is equal to
' 38 'i , 165 J
. 165 )
' 19 'i
a65J
:§) ■ (i)
2 \3
'_2_i
= (-i)3(i)-(^)-(ftr=i-
This is not enough to rule out the existence of a solution. However, if we split the congruence to a system of congruences according to the factors of the modulus, we obtain
x2 = -1 (mod 3), x2 = 3 (mod 5), x2 =   5   (mod 11),
whence we can easily see that the first and second congruences have no solution. Therefore, neither does the original one. In particular,
=±) = -1 and
11.C.34.   Solve the congruence x2 - 23 = 0 (mod 77). Solution. Factoring the modulus, we get the system
x2-l = 0   (mod 11), x2 - 2 = 0   (mod 7).
Clearly, 1 is a quadratic residue modulo 11, so the first congruence of the system has (exactly) two solutions: x = ±1 (mod 11). Further, (2/7) = (9/7) = 1, and it should not take much effort to notice the solution: x = ±3 (mod 7).
We have thus obtained four simple systems of two linear congruences each. Solving them, we will get that the original congruence has the following four solutions: x = 10, 32, 45 or 67 (mod 77). □
11.C.35.   Find all primes p such that the integer below is a N\^irv quadratic residue modulo p: Y i) 3,   ii) — 3,   iii) 6.
Solution.
i) We are looking for primes p / 3 such that x2 = 3 (mod p) is solvable. Since p = 2 satisfies the above, we will consider only odd primes p / 3 from now on. For p = 1 (mod 4), it follows from the law of quadratic reciprocity that 1 = (3/p) = (p/3), which occurs if and only if p = 1 (mod 3). On the other hand, if p = — 1 (mod 4), then 1 = (3/p) = —(p/3), which holds for p = —1 (mod 3). Putting the conditions of both cases together, we arrive at p = ±1 (mod 12), which, together with p = 2, completes the set of all primes satisfying the given condition.
• to determine the greatest common divisor of two integers (and the coefficients of corresponding Bezout's identity);
• to decide whether a given integer is a prime or composite number.
) =• to factor a given integer to primes.
Basic arithmetic operations are usually executed on large integers in the same way as we were taught at primary school, i.e., we add in linear time and multiply and divide with remainder in quadratic time. The multiplication, which is a base for many other operations, can be performed asymptotically more efficiently (there exist algorithms of the type divide and conquer) - for instance, the Karatsuba algorithm (1960), running in time & (nlog2 3) or the Schonhage-Strassen algorithm (1971), which runs in 0(n log n log log n) and uses Fast Fourier Transforms - see also 7.2.5. Although it is asymptotically much better, in practice, it becomes advantageous for integers of at least ten thousand digits (it is thus used, for example, when looking for large primes in the GIMPS project).
11.5.2. Greatest common divisor and modular inverses.
As we have already shown, the computation of the solution of the congruence a ■ x = 1 (mod m) in variable x can be easily reduced (thanks to Bezout's identity) to the computation of the greatest common divisor of the integers a and m and looking for the coefficients k, I in Bezout's identity k-a+l-m = 1 (the integer k is then the wanted inverse of a modulo m).
function  extended_gcd (a ,m) if m == 0:
return (1,0)
else
(q,r)  :=  divide (a,m) (k,l)  :=  extended_gcd(m, r) return  (1,k — q*1)
A thorough analysis9 shows that the problem of computing the greatest common divisor has quadratic time complexity.
11.5.3. Modular exponentiation. The algorithm for modular exponentiation is based on the idea that when computing, for instance 264 (mod 1000), one need not calculate 264 and then divide it with remainder by 1000, but that it is better to multiply the 2's gradually and reduce the temporary result modulo 1000 whenever it exceeds this value. More importantly, there is no need to perform such a huge number of multiplications: in this case, 63 naive multiplications can be replaced with six squarings, as
264 = (((((22)2)2)2)2)2.
See, for example, D. Knuth, Artof'ComputerProgramming, Volume!: Seminumerical Algorithms, Addison-Wesley 1997 or Wikipedia, Euclidean algorithm, http : //en .wikipedia . org/wiki/Euclidean_ algorithm (as of July 29, 2017).
775
CHAPTER 11. ELEMENTARY NUMBER THEORY
ii) The condition 1 = (—3/p) = (—l/p)(3/p) is satisfied if either (-1/p) = (3/p) = lor (-1/p) = (3/p) = -1. In the former case (using the result of the previous item), this means that p = 1 (mod 4) andp = ±1 (mod 12). In the latter case, we must have p = — 1 (mod 4) and p = ±5 (mod 12), at the same time - we can take, for instance, the set {—5, —1,1,5} for a reduced residue system modulo 12, and since (3/p) = 1 for p = ±1 (mod 12), we surely have (3/p) = — 1 whenever p = ±5 (mod 12). We have thus obtained four systems of two congruences each. Two of them have no solution, and the remaining two are satisfied by p = 1 (mod 12) and p = — 5 (mod 12), respectively.
iii) In this case, (6/p) = (2/p)(3/p) and once again, there are two possibilities: either (2/p) = (3/p) = 1 or (2/p) = (3/p) = —1. The former case occurs if p satisfies p = ±1 (mod 8) as well as p = ±1 (mod 12). Solving the corresponding systems of linear congruences leads to the condition p = ±1 (mod 24). In the latter case, we get p = ±3 (mod 8) as well as p = ±5 (mod 12), which together gives p = ±5 (mod 24).
Let us remark that thanks to Dirichlet's theorem 11.2.7, the number of primes we were interested in is infinite in each of the three problems. □
11.C.36.   The following exercise illustrates that if the mod-s\\)iv ulus of a quadratic congruence is a prime p satisfy-'Af^c ing p = 3 (mod 4), then we are able not only to IP    decide the solvability of the congruence, but also to describe all of its solutions in a simple way.
Consider a prime p = 3 (mod 4) and an integer a such that (a/p) = 1. Prove that the solution of the congruence x2 = a (mod p) is
x = ±a^~    (mod p).
Solution. It can be easily verified (using lemma 11.4.12) that
(a^t1)2 =        = a ■ (^) = a (mod p) . □
11.C.37.   Determine whether the congruence
x2 = 3   (mod 59)
is solvable. If so, find all of its solutions. Solution. Calculating the Legendre symbol
(-1) = 1.
function modular_pow   (base , exp , mod) result  := 1 while exp  > 0
if  (exp % 2 == 1):
result  :=  (result * base) % mod exp   := exp  >> 1 base  :=  (base * base) % mod return result
The algorithm squares the base modulo n for every binary digit of the exponent (which can be done in quadratic time in the worst case) and it performs a multiplication for every one in the binary representation of the exponent. Altogether, we are able to do modular exponentiation in cubic time in the worst case. We can also notice that the complexity is a good deal dependent on the binary appearance of the exponent.
Example. Let us compute 2560 (mod 561).
Since 560 = (1000110000)2, the mentioned algorithm gives
exp	base	result	last digit exp
560	2	1	0
280	4	1	0
140	16	1	0
70	256	1	0
35	460	1	1
17	103	460	1
8	511	256	0
4	256	256	0
2	460	256	0
1	103	256	1
0	511	1	0
Therefore, 2560 = 1 (mod 561).
11.5.4. Primality testing.   Although we have the Funda-f@P»     mental theorem of arithmetic, which guarantees .   that every natural number can be uniquely fac-tored to a product of primes, this operation is C^is#—   very hard from the computational point of view. In practice, it is usually done in the following steps:
(1) finding all divisors below a given threshold (by trying all primes up to the threshold, which is usually somewhere around 106);
(2) testing the remaining factor for compositeness (deciding whether some necessary condition for primality holds);
(a) if the compositeness test did not find the integer to be composite, i.e., it is likely to be a prime, then we test it for primality to verify that it is indeed a prime;
(b) if the compositeness test proved that the integer was composite, then we try to find a non-trivial divisor.
The mentioned steps are executed in this order because the corresponding algorithms are gradually (and strongly) increasing in time complexity. In 2002, Agrawal, Kayal, and
776
CHAPTER 11. ELEMENTARY NUMBER THEORY
we find out that the congruence has two solutions. Thanks to the statement above, we can immediately see (59 = 3 (mod 4)) that the congruence is satisfied by
59 + 1 , c- c- q
x = ±3 *   = ±315 = (35)3 = = ±73 = ±343 = Til   (mod 59),
since 35 = 243 = 7 (mod 59). □
D. Diophantine equations
It is as early as in the third century AD when Diophantus of Alexandria dealt with miscellaneous equations while admitting only integers as solutions. And there is no wonder -in many practical problems that lead to equations, non-integer solutions may fail to have a meaningful interpretation. As an example, we can consider the problem of how to pay an exact amount of money with coins of given values.
In honor of Diophantus, equations for which we are interested in integer solutions only are called Diophantine equations.
Another nice example of a Diophantine equation is Eu-ler's relation
v-e+f= 2
from graph theory, connecting the number of vertices, edges, and faces of a planar graph. Furthermore, if we restrict ourselves to regular graphs only, we get to the problem about existence of the so-called Platonic solids, which can be smartly described just as a solution of this Diophantine equation - for more information, see 13.1.22.
Unfortunately, there is no universal method for solving this kind of equations. There is even no method (algorithm) to decide whether a given polynomial Diophantine equation has a solution. This question is well-known as Hubert's tenth problem, and the proof of algorithmic unsolvability of this problem was given by lOp™ MaTnaceBira (Yuri Matiya-sevich) in 1970.1
However, there are cases in which we are able to find the solution of a Diophantine equation, or - at least - to reduce the problem to solving congruences, which is besides the already mentioned applications another motivation for studying them. Now, we will describe several such types of Diophantine equations.
See the elementary text M. Davis, Hubert's Tenth Problem is Unsolv-able, The American Mathematical Monthly 80(3): 233-269. 1973.
Saxena published an algorithm for primality testing in polynomial time, but it is still more efficient to use the above procedure in practice.
11.5.5. Compositeness tests - how to recognize composite numbers with certainty? The so-called compositeness tests check for some necessary condition for primality. The easiest of such conditions is Fermat's little theorem.
Proposition (Fermat's test). Let N be a natural number. If
there is an a ^ 0 (mod 7Y) such that aN_1 ^ 1 (mod 7Y), then N is not a prime.
Unfortunately, having a composite 7Y, it still may not be easy to find such an integer a which reveals the compositeness of 7Y. There are even such exceptional integers 7Y for which the only integers a with the mentioned property are those which are not coprime to 7Y. To find them is thus equivalent to finding a divisor, and thus to factoring 7Y to primes.
There are indeed such ugly (or extremely nice?) composite numbers 7Y for which every integer a which is co-prime to 7Y satisfies a""1 = 1 (mod 7Y). These are called Carmichael numbers, the least of which10 is 561 = 3 1117, and it was no sooner than in 1992 that it was proved11 that there are even infinitely many of them.
Example. We will prove that 561 is a Carmichael number, i.e., that it holds for every a e N which is coprime to 3-11-17 that a560 = 1 (mod 561).
Thanks to the properties of congruences, we know that it suffices to prove this congruence modulo 3,11, and 17. However, this can be obtained straight from Fermat's little theorem since such an integer a satisfies a2 = 1 (mod 3), a10 = 1 (mod 11), a16 = 1 (mod 17), where all of 2, 10, and 16 divide 560, hence a560 = 1 modulo 3, 11 as well as 17 for all integers a coprime to 561 (see also Korselt's criterion mentioned below).
11.5.6. Proposition (Korselt's criterion). A composite number n is a Carmichael number if and only if both of the following conditions hold
• n is square-free (divisible by the square of no prime),
• p — 1 | n — 1 holds for all primes p which divide n.
Proof. " <= " We will show that if n satisfies the above two conditions and it is composite, then every a e Z which is coprime to n satisfies an_1 = 1 (mod n). Let us thus factor n to the product of distinct odd primes: n = p\ ■ ■ ■ pk, where Pi — 1 n — 1 for alii G {1,,..., k}. Since (a, pi) = 1, we get from Fermat's little theorem that aVx~x = 1 (mod pi), whence (thanks to the condition pi — 1 | n — 1) it also follows
The first discoverer of the first seven Carmichael numbers is the Czech priest and mathematician Václav Šimerka (1819-1887), who occupied himself with them much earlier than the American mathematician R. D. Carmichael (1879-1967), whose name they bear.
11W. R. Alford, A. Granville and C. Pomeranče, There are Infinitely Many Carmichael Numbers, Annals of Mathematics, Vol. 139, No. 3(1994), pp. 703-722.
777
CHAPTER 11. ELEMENTARY NUMBER THEORY
Linear Diophantine equation. A linear Diophantine equa-tion is an equation of the form
where icTT
aixi + a2x2 H----+ anxn
•j    arc unknowns and ax,.
, b are given
non-zero integers.
We can see that the ability to solve Diophantine equations is sometimes important in "practical" life as well, as is proved by Bruce Willis and Samuel Jackson in Die Hard with a Vengeance, where they have to do away with a bomb using 4 gallons of water, having only 3- and 5-gallon containers at their disposal. A mathematician would say that the gentlemen were to find a solution of the Diophantine equation 3x+5y = 4.
One can use congruences in order to solve these equations. Apparently, it is necessary for the equation to be solvable that the integer d = (ax,..., an) divides b. Provided that, dividing both sides of the equation by the number d leads to an equivalent equation
a'xXx + a2x2 H-----h a'nxn = b',
, n and b' = b/d. Here, we
where a\ = a, /d for i = 1, have
d- (a[,...,a'n) = (da[,. .., da'n) = (ax,.
= d,
so (a[,.
1.
Further, we will show that the equation
axxx + a2x2 H----+ anxn = b,
that an 1 = 1 (mod p{). This is true for all indices i, hence an-i = ^ (mod n), so n is indeed a Carmichael number.
" => " A Carmichael number ti cannot be even since then we would get for a = — 1 that a71-1 = — 1 (mod n), which would (since an_1 = 1 (mod n)) mean that n is equal to 2 (and thus is not composite).  Therefore, let n factor as
■ p^k, where pi are distinct odd primes and a, e N.
' lanks to theorem 11.3.6, we can choose for every i a primi-e root gi modulo p"', and the Chinese remainder theorem ;n yields an integer a which satisfies a = g{ (mod p"') for i and which is apparently coprime to n. Further, we know >m the assumption that an_1 = 1 (mod n), so this holds jdulo p"', and thus g™-1 = 1 (mod p"') as well. Since gi i primitive root modulo p"', the integer n — 1 must be a mul-ile of its order, i.e. amultiple of p(p"') = pt'~1(Pi ~ !)■ the same time, we have (pi,n — 1) = 1 (since pi\n), so cessarily a, = 1 and pi — 1 | n — 1. □
Fermat's primality test can be slightly improved to Eu-'s test or even more with the help of the Jacobi symbol, yet s still does not mend the presented problem completely.
oposition (Euler's test). Let N be an odd natural number. If there is an integer a ^ 0 (mod N) such that a~^~ ^ ±1 (mod N), then N is not a prime.
Proof. This follows directly from Fermat's theorem and
the fact the for N odd, we have a
!)•
jv-l
(a-
-l)(a-
□
Proposition (Euler-Jacobi test). Let N be an odd natural number. If there is an integer a ^ 0 (mod N) such that a~^~ ^ (^ (mod N), then N is not a prime.
Proof. This follows immediately from lemma 11.4.12.
□
Example. Let us consider N = 561 = 3 • 11 • 17 as before and let a = 5. Then, we have 5280 = 1 (mod 3) and 5280 = 1 (mod 10), but 5280 = -1 (mod 17), so surely 5280 ^ ±1 (mod 561). Here, it did not hold that a^"1)/2 = ±1 (mod N), so we even did not need to check the value of the Jacobi symbol (5/561). However, the Euler-Jacobi test can often reveal a composite number even in the case when this power is equal to ±1.
Example. Euler's test cannot detect the compositeness of the integer N = 1729 = 7-13-19 since the integer === = 864 = 25 ■ 33 is divisible by 6, 12, and 18, and so it follows from Fermat's theorem that a^-1^2 = 1 (mod N) holds for all integers a coprime to A. On the other hand, we get already for a = 11 that (11 /172 9) = -1, so the Euler-Jacobi is able to recognize the integer 1729 as composite.
Let us notice that the value of the Legendre or Jacobi symbol (a/n) can be computed very efficiently thanks
778
CHAPTER 11. ELEMENTARY NUMBER THEORY
where ai, a2,..., an, b are integers such that (ai,..., an) =
I, always have a solution in integers and all such solutions can be described in terms of n — 1 integer parameters.
We will prove this proposition by mathematic induction on n, the number of unknowns. The situation is trivial for 7i = 1 - there is a unique solution (which does not depend on any parameters). Further, let n > 2 and suppose that the statement holds for equations having n — 1 unknowns. Denoting d = (a1,..., a„_i), any n-tuple x1,..., xn that satisfies the equation must also satisfy the congruence
aiXi + a2a;2 + ■ ■ ■ + anxn = b   (mod d).
Since d is the greatest common divisor of the integers a1,..., a„_i, this congruence is of the form
anxn = b   (mod d),
which (since (d, an) = (a1,..., an) = 1) has a unique solution
xn = c   (mod d),
where c is a suitable integer, i.e., xn = c+d-t, where t G Zis arbitrary. Substituting into the original equation and refining it leads to the equation
aiXi + ■ ■ ■ + an-ixn-i = b — anc — andt
with n—1 unknowns and one parameter, t. However, the number (b — anc)/d is an integer, so we can divide the equation by d. This leads to
a'1x1 H-----h a'n_1xn_1 = b',
where a\ = a{/d for i = 1,... ,ti — 1 and b' = ((6 — anc)/d) — ant, satisfying
(ai,. .. ^.-J = (da[,... ,da^_i)-g = (ai, • • • ,are-i)
By the induction hypothesis, this equation has, for any ieZ, a solution which can be described in terms of n—2 integer parameters (different from t), which together with the condition £n = c + di gives what we wanted.
II. D.l.   Decide whether it is possible to use a balance scale to weigh 50 grams of given goods provided we
7 have only (an arbitrary number of) three kinds
1 of masses; their weights are 770, 630, and 330 grams, respectively. If so, how to do that?
Solution. Our task is to solve the equation
770a; + 630y + 330z = 50,
to the law of quadratic reciprocity12, namely in time
0((loga)(logn)).
pseudoprimes
A composite number n is called a pseudoprime if it passes the corresponding test of compositeness without being revealed. We thus have
(1) Fermat pseudoprimes to base a,
(2) Euler (or Euler-Jacobi) pseudoprimes to base a,
(3) strong pseudoprimes to base a, which are composite numbers which pass the following compositeness test:
The subsequent test is simple, yet (as shown in theorem 11.5.8) very efficient. It is a specification of Fermat's test, which we have introduced at the beginning.
11.5.7. Theorem. Let p be an odd prime. Let us write p—1 = 2* ■ q, where t is a natural number and q is odd. Then, every integer a which is not a multiple ofp satisfies aq = 1 (mod p) or there exists an e G {0,1,        — 1} such that a2 q = — 1 (mod p).
Proof. It follows from Fermat's little theorem that
p | aP'1 - 1 = (a2^ - lXa2^ + 1) =
= (a1^1 - lXa2^ + lXa1^ + 1) =
= (a9 - l)(a9 + l)(a29 + 1) ■ ■ ■ (a2'"9 + 1), whence the statement follows easily since p is a prime. □
Proposition (Miller-Rabin compositeness test). Let N, t, q
be natural numbers such that N is odd and N — 1 = 2* ■ q, 2 { q. If there is an integer a ^ 0 (mod 7Y) such that
aq iL 1   (mod N)
n2'q
(mod JV)   for    eG {0, -1},
= 1
then N is not a prime.
Proof. The correctness of the test follows directly from
the previous theorem.
□
Miscellaneous types of pseudoprimes
See H. Cohen, A Course in Computational Algebraic Number The-
ory, Springer, 1993.
779
CHAPTER 11. ELEMENTARY NUMBER THEORY
where x,y,z £ Z (a negative value in the solution would mean that we put the corresponding masses on the other scale). Dividing both sides of the equation by (770,630,330) = 10, we get an equivalent equation
77a; + 63y + 33z = 5.
Considering this equation modulo (77,63) = 7, we get the following linear congruence:
33z = 5   (mod 7), 5z = 5   (mod 7), z = l   (mod 7).
This congruence is thus satisfied by those integers z of the form 2 = 1 + It, where t is an integer parameter.
Substituting the form of z into the original equation, we
get
77a; + 63y = 5-33(1 + 7t), llx + 9y = -4 - 33t.
Now, we consider this (parametrized) equation modulo 11:
9y = -4-33i   (mod 11), -2y = -4   (mod 11), y = 2   (mod 11).
Therefore, this congruence is satisfied by integers y = 2+lis for any s £ Z. Now, it only remains to calculate x:
lla; = -4-33/j-9(2 + lls), llx = -22 - 33f-9-lls, x = -2-3t- 9s.
We have found out that the equation is satisfied if and only if
(x, y, z) is in the set
{(-2-3/j-9s,2 + 11s,1 + 7/j);s,/j £ Z}.
Particular solutions can be obtained by evaluating the triple at concrete values of t, s. For instance, setting t = s = 0 gives the triple (—2,2,1); putting t = —4, s = 1 leads to
(1,13,-27).
Of course, the unknowns can be eliminated in any order -the result may seem "syntactically" different, but it must still describe the same set of solutions (that is given by a particular coset of an appropriate subgroup (in our case, it is the subgroup (2, 2,1) + (3,0,7)Z+ (-9,11,0)Z) in the commutative group Z3, which is an apparent analog to the fact that
In practice, this easy test rapidly increases the ability to recognize composite numbers. The least strong pseudoprime to base 2 is 2047 (while the least Fermat pseudoprime to base 2 was already 341), and considering the bases 2, 3, and 5, the least strong pseudoprime is 25326001. In other words, if we are to test integers below 2107, then it is sufficient to execute this compositeness test already for the bases 2,3, and 5. If the tested integer is not revealed to be composite, then it is surely a prime. On the other hand, it has been proved that no finite basis is sufficient for testing all natural numbers.
The Miller-Rabin test is a practical application of the previous statement, and we are even able to bound the probability of failure thanks to the following theorem, which we present without a proof.
11.5.8. Theorem. Let N > 10 be an odd composite number. Let us write N — 1 = = 2* ■ q, where t is a natural number and q is odd. Then, at most a quarter of the integers from the set {a £ Z; 1 < a < N, (a, N) = 1} satisfies the following condition:
aq = 1   (mod N) or there is an e £ {0,1,..., t — 1} satisfying
aTq = -l   (mod N).
In practical implementations, one usually tests about 20 random bases (or the least prime bases). In this case, the above theorem states that the probability of failing to reveal a composite number is less than 2~40.
The time complexity of the algorithm is same as in the case of modular exponentiation, i.e. cubic in the worst case. However, we should realize that the test is non-deterministic and the reliability of its deterministic version depends on the so-called generalized Riemann hypothesis (GRH ).
11.5.9. Primality tests.  Primality tests are usually applied
* when the used compositeness test claims that the iiX JSt exammed integer is likely to be a prime, or they 'IwfjJlP ^ executed straightaway for special types of in-WX'' tegers. Let us first give a list of the most known tests, which includes historical tests as well as very modern ones.
(1) AKS - a general polynomial primality test discovered by Indian mathematicians Agrawal, Kayal, and Saxena in 2002.
(2) Pocklington-Lehmer test - primality test of subexponen-tial complexity.
(3) Lucas-Lehmer test - primality test for Mersenne numbers.
(4) Pepin's test - primality test for Fermat numbers from 1877.
(5) ECPP - primality test based on the so-called elliptic curves.
Now, we will introduce a standard primality test for Mersenne numbers.
Wikipedia, Riemann hypothesis, http://en.wikipedia. org/wiki/Riemann_hypothesis (as of July 29, 2017).
780
CHAPTER 11. ELEMENTARY NUMBER THEORY
the solution of such an equation over a field forms an affine subspace of the corresponding vector space). □
11.D.2. Other types of Diophantine equations reducible to congruences. Some Diophantine equations are such that one of the unknowns can be expressed explicitly as a function of the other ones. In this case, it makes sense to examine for which integer arguments it holds that the value of the function is also an integer.
For instance, having an equation of the form
mxn = f(x1,... ,xn-i),
where m is a natural number and f (xi,... ,xn_i) e Z[x1,..., xn-i] is a polynomial with integer coefficients, an n-tuple of integers x\,..., xn is a solution of it if and only if
f{xi, ■ ■ ■ ,Zn-i) = 0   (mod m).
11.D.3. Solve the Diophantine equation x (x + 3) = Ay — 1. Solution. The equation can be rewritten as Ay = x2 + 3x +1. Now, we will solve the congruence
x2 + 3x + l = 0   (mod 4).
This congruence has no solution since for any integer x, the polynomial x2 + 3x + 1 evaluates to an odd integer (the fact that the congruence is not solvable can also be established by trying out all four possible remainders modulo 4 into it).
□
11.D.4.   Solve the following equation in integers:
379a; + 314y + 183y2 = 210
Solution. The equation is linear in a;, so the other unknown, y, must satisfy the congruence
183y2 + 314y-210 = 0   (mod 379).
Now, we can complete the left-hand polynomial to square in order to get rid of the linear term. First of all, we must find a f £ Z such that 183 ■ t = 1 (mod 379). (In other words, we need to determine the inverse of the integer 183 modulo 379). For this purpose, we will use the Euclidean algorithm:
379 = 2 ■ 183 + 13,
183 = 14-13 + 1,
whence
1 = 183 - 14 • 13 = 183 - 14 ■ (379 - 2 ■ 183) = = 29 ■ 183 - 14 ■ 379.
Proposition (Lucas-Lehmer test). Let q 2 be a prime, and let a sequence (sn)^o ^e defined recursively by
s0 = 4,s„+i = s2 - 2.
Then, the integer Mq = 2q — 1 is a prime if and only if Mq divides sq-2-
Proof. We will be working in the ring R = Z[V3] = = {a + b^3; a,b e Z}, where the division with remainder behaves similarly as in the integers (see also 12.2.5). Let us seta = 2+v/3,/3 = 2--^3 and note that a+/3 = 4,q-/3 = 1.
First, we prove by induction that it holds for all n e No
that
(1)
+ /32" = /32"
1 + q2
The statement is true for n = 0 since s0 = A = a + /3. Now.let us suppose thatit is true forn—1, then sn = s2_i — 2
is, by the induction hypothesis, equal to
2 = q2" + /32" .
Further, since M„
+ I32
-1 (mod 8), we have (2/Mq) 1, and it follows from the law of quadratic reciprocity that
Mr
MQ
2q - 1
-1,
since we have 2q — 1 = 1 (mod 3) for q odd. Both of these expressions are valid even if Mq is not a prime (in this case, it is the Jacobi symbol).
Let us note that in the last part of the proof, we will use the extension of the congruence relation to the elements of the domainZfv7^] = {a+b^3; a,b e Z}; just like in the case of the integers, we write for a,/3e Z[-^3] that a = j3 (mod p) if p | a—/3. Further, an analog of proposition (ii) from 11 .B.6 holds as well - if p is a prime, then (a + 0)p = ap + ff (mod p) (the proof is identical to the one for the integers). " => " Suppose that Mq is a prime. We will prove that a2' = —1 (mod Mq), which will imply (thanks to 1) that Mq | sg_2. Since 2<M^-1)/2 = (2/Mq) = 1 (mod M,), there is a y e Z such that 2y2 = 1 (mod Mq). We have
(y(l + V/3))2 =y2(4 + 2v/3) = q   (mod Mq), whence, invoking Fermat's theorem and the relation 2q~1 =
M„ + l
-, we get
= y = y2
r\\M"+1
y^-^l + V^j-^ + Vs
1 + Vfj ■ (l - V7!) = -2y2 (mod Mq).
When deriving this, we made use of the fact that 3 is a quadratic nonresidue modulo Mq, so
1 +
= l+(v^J q =l + 3<M*-1)/2.v/3 = 1 - V3   (mod Mq).
781
CHAPTER 11. ELEMENTARY NUMBER THEORY
Therefore, we can take, for instance, the integer 29 to be our t. Now, multiplying bith sides of the congruence by t = 29 and rearranging it, we get an equivalent congruence:
y2 + Wy - 26 = 0   (mod 379)
Now, we can complete the left-hand polynomial to square, which leads to (substituting z = y + 5)
(y + 5)2 - 52 - 26 = 0   (mod 379), z2 = 51   (mod 379).
Invoking the law of quadratic reciprocity, we calculate the Legendre symbol (51/379):
For the other direction, let M„
then
M„
Sq-2 ■
l + az
However,
379)     (379) ' (379)     ( 3 ) '    ^ ' ( 17 ) '
If p ^ 2, 3 is a prime divisor of Mq, then (mod p) as well and a2' = 1 (mod p). Hence it follows that 2q is the order of a in the multiplicative group Tp = {a + bVŠ; 0 < a,b < p] \ {0}.
If we had (3/p) = 1, then we would get
qP"1 = /3 ■ ď = 13 ■ (V + (v/3)P)
= /3- (2 + VŠ- 3{p-1)/2^ = 13 ■ (2 + Vfj
whence p — 1 would be a multiple of the order of a, i.e. 2q. However, this would mean that p > p — 1 > 2q > 2q — 1 = Mq, which contradicts the fact that p is a divisor of Mq. ^Therefore, we have (3/p) = —1 and
ap+1 = (2 + V3) Í2 + V3
1,
whence it follows that the congruence is solvable, and, in particular, it has two solutions modulo 379.
The proposition of exercise 11.C.36 implies that the solutions are of the form
380
z = ±51~,
where 513 = 1 (mod 379), whence 5195 = (513)31 ■ 512 = -52 (mod 379). The solution is thus z = ±52 (mod 379), which gives for the original unknown that
y = 47   (mod 379),      y = -57   (mod 379).
Therefore, the given Diophantine equation is satisfied by those pairs (x,y) withy G {47 + 379 ■ k; k G Z} U {-57 + 379 -k; k G 1} and x = ^ ■ (210 - 314y - 183y2); e. g. (-1105,47) or (-1521, -57) (which are the only solutions with \x\ < 105). □
11.D.5.   Solve the equation 2X = 1 + 3^ in integers.
Solution. If y < 0, then 1 < 1 + 3y < 2, whence 0 < x < 1, so x could not be an integer. Therefore, y > 0, hence 2X = 1 + 3^ > 2 and x > 1. We will show that we also must have x < 2. If not (i.e., if x > 3), then we would have
1 + 3^ = 2X = 0   (mod 8),
whence it follows that
3V = -1   (mod 8).
2 + v^j (2 - V3
= 1   (mod p).
The order of a modulo p is 2q, hence 2q | p+1 and especially p > 2q — 1 == Mq. At the same time, p is a prime divisor of Mq, therefore, Mq = p is a prime. □
Unlike the proof, implementation of this algorithm is ,f^± .      very easy.
Algorithm (Lucas-Lehmer primality test):
function LL_is_prime (q)
s := 4;M := 2« - 1 repeat g — 2 times
s := s2 - 2 (mod M) if s = 0 return PRIME, else  return COMPOSITE.
The time complexity of this test is asymptotically the same as in the case of the Miller-Rabin test. It is, however, more efficient in concrete instances.
Fermat numbers are integers of the form Fn = 22" + 1.
Pierre de Fermat conjectured in the 17th century that all of the integers of this form are {^r primes (apparently driven by the effort to generalize the observation for F0 = 3, Fi = 5, F2 = 17, F3 = 257, and F4 = 65537). However, in the 18th century, Leonhard Euler found out that F5 = 641 x 6700417, and we have not been able to discover any other Fermat primes so far. Since their size increases rapidly, it takes much time resources to compute with them (so the following test is not much used). Nowadays, the least Fermat number which has not been tested is F33, which has 2 585 827 973 digits, and it is thus much greater than the largest discovered prime.
782
CHAPTER 11. ELEMENTARY NUMBER THEORY
However, this impossible since the order of 3 modulo 8 equals 2, so the powers of three are congruent to 3 and 1 only. Now, it remains to examine the possibilities x = 1 and x = 2. For x = 1, we get
3» = 21 - 1 = 1,
hence y = 0. If x = 2, we have
3y = 22 - 1 = 3,
whence y = 1. Thus, the equations has two solutions: a; = 1,
y = 0; and x = 2, y = 1. □
11.D.6. Pythagorean equation. In this section, we will deal . with enumeration of all right triangles with integer side lengths. This is a Diophantine equation where we will only seldom use the methods de-scribed above; nevertheless, we will look at it in detail.
The task is to solve the equation
2,2 2
x + y = z
in integers.
Solution. Clearly, we can assume that (x,y,z) = 1 (otherwise, we simply divide both sides of the equation by the integer d = (x, y, z)).
Further, we can show that the integers x, y, z are pair-wise coprime: if there were a prime p dividing two of them, then we can easily see that it would have to divide the third one as well, which it may not according to our assumption. Therefore, at most one of the integers x, y is even. If neither of them were, we would get
z2 = x2 + y2 = 1 + 1   (mod 8),
which is impossible (see exercise 11.A.2). Altogether, we get that exactly one of the integers x, y is even. However, since the roles of these integers in the equation are symmetric, we can, without loss of generality, select x to be even and set x = 2r, r e N. Hence, we have
A   2 2 2
4r = z — y ,
2   z+y z-y
r = -■ -.
2 2
Now, let us denote u = \(z + y), v = \(z — y) (then, the inverse substitution is z = u + v, y = u — v). Since y is coprime to z, so is u to v (if there were a prime p dividing
Proposition (Pepin's test). A necessary and sufficient condition for the n-th Fermat number Fn to be a prime is
3~^~ = -1   (mod Fn).
We can see that this is a very simple test, which is actually a mere part of Euler's compositeness test.
Proof of correctness of Pepin's test. First, suppose that S^-1)/2 = -1 (mod Fn). Then, 3^-* = 1 (mod Fn). Since Fn — 1 is a power of two, Fn — 1 is necessarily the order of 3 modulo Fn. However, the order of every integer modulo Fn is at most <p(Fn) < Fn — 1, hence in this case, we have <p(Fn) = Fn — 1, which means that Fn is a prime.
For the other direction, let Fn be a prime. From part (i) of lemma 11.4.12, we get that S^-1)/2 = (3/Fn) (mod Fn), so it suffices to determine the value (3/Fn). However, this is easy, because Fn = 2 (mod 3), and thus (Fn/3) = —1. Further, we have Fn = 1 (mod 4), and the law of quadratic reciprocity thus yields (3/Fn) = —1, which is what we wanted to prove. □
Now, we will introduce a primality test which is a bit old, J: i, yet it is still widely used in modern computation systems - the so-called Pocklington-Lehmer test. However, first of all, we will describe a simpler primality fS   test for illustration, the so-called Lucas's test:
11.5.10. Theorem (Lucas). If for all prime divisors q ofN — 1, there is an a such that aw_1 = 1 (mod N), a i ^ 1 (mod 7Y), then N is a prime.
Proof. It suffices to prove that N — 1 divides <p(N) (which is a condition apparently unsatisfied by composite numbers). If not, then there is a prime q and r e N such that qr divides N — 1, but it does not divide <p(N). The order e of the integer a divides N — 1 (the first condition) and does not divide (N — l)/q (the second condition), hence qr divides e. Furthermore, e divides <p(N), and so qr does, a contradiction. □
The integer a from the previous theorem is called a primality witness for the integer N (in this as well as in the other primality tests).
Another general primality test is based on the above one. It is good if we want to make the high probability of the answer of the Miller-Rabin compositeness test into certainty.
11.5.11. Theorem (Pocklington-Lehmer). Let N be a natural number, N > 1. Let p be a prime which divides N — 1. Further, let us suppose that there is an integer ap such that
N-l _ -i
ap     = 1
(mod 7Y)   and   I a
N-l
P
1,N
Let pa" be the highest power ofp which divides N every positive divisor d of the integer N satisfies
d=l   (mod pa").
1.
1. Then
783
CHAPTER 11. ELEMENTARY NUMBER THEORY
both u and v, then it would divide their sum as well as their difference, i.e., the integers y and z). It follows from
r2 = u ■ v
that there are coprime positive integers a, b such that u = a2, v = b2. Moreover, since u > v, we must have a > b. Altogether, we get
x = 2dr = 2ab,
y = u — v = (a2 — &2),
2: = u + w = (a2 + &2),
which indeed satisfies the given equation for any coprime a, b e N with a > b. Further solutions can be obtained by interchanging x and y. Finally, relinquishing the condition (x, y, z) = 1, each solution will yield infinitely many more if we multiply each of its component by a fixed positive integer d. □
11.D.7. Fermat's Last Theorem for n = 4. Thanks to the [V parametrization of Pythagorean triples, we will be able to prove that the famous Fermat's Last Theorem
xn + yn = zn
has no solution for n = 4 in integers. Prove that the equation x4 + y4
z2 has no solution
inN.
Solution. We will use the so-called method of infinite descent, which was introduced by Pierre de Fermat. This method utilizes the fact that every non-empty set of natural numbers has a least element (in other words, N is a well-ordered set).
Therefore, suppose that the set of solutions of the equation x4 + y4 = z2 is non-empty and let (x, y, z) denote (any) solution with z as small as possible. The integers x, y, z are thus pairwise distinct. Since the equation can be written in the form
(x2)2 + (y2)2 = z2,
it follows from the previous exercise that there exist r, s e N such that
2rs,    y = r
2 1 2 r + s .
Hence, y2 + s2 = r2, where (y, s) = 1 (if there were a prime p dividing both y and s, then it would divide x as well as z, which contradicts that they are coprime). Making the
Proof of the Pocklington-Lehmer theorem. Every positive divisor d of the integer A is a product of prime divisors of N, so it suffices to prove the theorem for prime values of d. The condition ap_1 = 1 (mod N) implies that the integers ap, N are coprime (any divisor they have in common must divide the right-hand side of the congruence as well). Then, (ap, d) = 1 as well, and we have a^-1 = 1
1,N) -
(mod d) by Fermat's theorem. Since (ap
we get a^-1"1^ ^ 1 (mod d).
(N-l)/p
1,
Let e denote the order of a„ modulo d. Then, e I d — 1,
e| A-l,andef(A-l)/p.
If pa" f e, then e | N -which is a contradiction. Therefore, pa" d-1.
1 would imply that e |
e, and so pQp
□
11.5.12. Theorem. Let N e N, N > 1. Suppose that we can
write N - 1 = F ■ U, where (F, U) = 1 and F > VN, and that we are familiar with the prime factorization of F. Then:
• if we can find for every prime p | F an integer ap G Z from the above theorem, then N is a prime;
• if N is a prime then for every prime p | N — 1, there is an integer ap G Z with the desired properties.
Proof. By theorem 11.5.11, the potential divisor d > 1 of the integer N satisfies d = 1 (mod pa") for all prime factors of F, hence d = 1 (mod F), and so d > ^N. If N has no non-trivial divisor less than or equal to \fN, then it is necessarily a prime. On the other hand, it suffices to choose for ap a primitive root modulo the prime N (independently of p). Then, it follows from Fermat's theorem that a.
N-l
1
(N-l)/p
(mod N), and since ap is a primitive root, we get ap 1 (mod N) for any p | N — 1.
The integers ap are again called primality witnesses for the integer N. □
Remark. The previous test also contains Pepin's test in itself (here, for N = Fn, we have p = 2, which is satisfied by the primality witness ap = 3).
11.5.13. Looking for divisors. If one of the composite-ness tests verifies that a given integer is indeed composite, we usually want to find one of its non-trivial divisors. However, this task is much more difficult than mere revealing that it is composite - let us recall that the compositeness tests can guarantee the compos-iteness, yet they provide us with no divisors (which is, on the other hand, advantageous for RSA and similar cryptographic protocols). Therefore, we will present a short summary of methods used in practice and a short sample for inspiration.
(1) Trial division
(2) Pollard's p-algorithm
(3) Pollard's p — 1 algorithm
(4) Elliptic curve method (ECM)
(5) Quadratic sieve (QS)
(6) Number field sieve (NFS)
784
CHAPTER 11. ELEMENTARY NUMBER THEORY
Pythagorean substitution once again, we get natural numbers a, b with (y is odd)
y = a2-b2,    s = 2ab,    r = a2 + b2.
The inverse substitution leads to
x2 = 2rs = 2 ■ 2ab(a2 + b2),
and since x is even, we get
(|)2 = ab{a2 + b2).
The integers a,b,a2 + b2 are pairwise coprime (which can be derived easily from the fact that y is coprime to s). Therefore, each of them is a square of a natural number:
a = c2,    b = d2,    a2 + b2 = e2,
whence c4 + d4 = e2, and since e < a2 + b2 = r < z, we get a contradiction with the minimality of z. □
E. Primality tests
11.E.1. Mersenne primes. The following problems are in <gSj2> deep connection with testing Mersenne num-
#bers for primality. For any q G N, consider the integer Mq = 2 9 — 1 and prove:
i) If q is composite, then so is Mq.
ii) If q is a prime, g = 3 (mod 4), then 2g + 1 divides Mq if and only if 2g + 1 is a prime (hence it follows that if g = 3 (mod 4) is a Sophie Germain prime2, then Mq is not a prime).
iii) If a prime p divides Mq, then p = ±l (mod 8) and p = 1 (mod g).
Solution.
Viz Wikipedia, Sophie Germain prime, http : //en . wikipedia. org/wiki/Sophie_Germain_prime (as of July 28, 2013, 14:43 GMT).
For illustration, we will demonstrate one of these algorithms - Pollard's p-method - on a concrete instance. This algorithm is especially suitable for finding relatively small divisors (since its expected complexity depends on the size of these divisors), and it is based on the idea that having a random function / : S —> S, where S is a finite set having n-elements, the sequence (3;n)^L0, where xn+1 = j(xn), must loop. The expected length of the tail as well as the period is then \J-k ■ n/8.
The algorithm described below is again a straightforward implementation of the mentioned reasonings. Algorithm (Pollard's p-method):
Input: n — the  integer to be  factored , and  an  appropriate  function j(x)
a := 2;b := 2;d:=\ While   d = 1 do
a := /(a)
b := /(/(b))
d := gcd(a — b, n) If d = n,  return FAILURE. Else  return d.
11.5.14. Public-key cryptography.    In present-day prac-
tice, the most important application of number theory is the so-called public-key cryptography. Its main objectives are to provide
• encryption; the message encrypted with the public key of the receiver can be decrypted by no one else (to be precise, by no one who does not know his private key);
• signature; the integrity of the message signed with the private key of the sender can be verified by anyone with access to his public key.
The most basic and most often used protocols in public-key cryptography are:
• RSA (encryption) and the derived system for signing messages,
• Digital Signature Algorithm - DSA and its variant based on elliptic curves (ECDSA),
785
CHAPTER 11. ELEMENTARY NUMBER THEORY
i) If n g, then it follows from exercise 11.A.6 that 2™ — 1
2q — 1, so Mn | Mq. Therefore, Mq is not a prime for n > 1.
ii) Let n = 2g+l be a divisor of Mq. We will show that n is aprime invoking Lucas' theorem 11.5.10, Since n — 1 = 2g has only two prime divisors, it suffices to find com-positeness witnesses for the integers 2 and q. We have
= 22 ^ 1 (mod n), (-2) V = -2« = - 1 ^ 1 (mod n), thanks to the assumption n | Mq = 29 — 1. Further, since (-2)™-1 = 2™-1 = 22q - 1 = (2« + l)Mq = 0 (mod n), it follows from Lucas' theorem that n is a prime.
Now, let p = 2g + 1 = —1 (mod 8) be a prime. Since (2/p) = 1, there exists an m such that 2 = m2 (mod p). Hence, 29 = 2^ = mp_1 = 1 (mod p), so p | 2« - 1 = = Mq.
iii) If p | Mq = 2q — 1, then the order of 2 modulo p must divide the prime g, hence it equals g. Therefore, g | p—1, and there exists a fc e Z such that 2gfc = p — 1. Altogether, we get
(2/p) = 2^ = 2qk = 1   (mod p),
i.e.,p = ±1 (mod 8). □
11.E.2.    For each of the following Mersenne numbers, determine whether it is prime or composite:
211  _ 1)215  _  li223  _  li229  _ ^
and283 - 1.
Solution. In the case of the integer 215 — 1, the exponent is composite; therefore, the whole integer is composite as well (we even know that it is divisible by 23 — 1 and 25 — 1). In the other cases, the exponent is always a prime. We can notice that these primes, namely g = 11,23,29, and 83, are even Sophie Germain primes (i.e., 2g + 1 is also a prime). It thus follows from part (ii) of the previous exercise that 23 | 211 —1, 47 | 223 - 1, and 167 | 283 - 1.
We cannot use this proposition for the last case since 29 ^ 3 (mod 4) and, indeed, 59 \ 229 - 1. Now, however, it follows from part (iii) of the above exercise that if there is a prime p which divides 289 — 1, then it must satisfy
p = ±1 p= 1
(mod 8) (mod 29),
• Rabin crypto system (and signature scheme),
• ElGamal crypto system (and signature scheme),
• elliptic curve cryptography (ECC),
• Diffie-Hellman key exchange protocol (DH).
11.5.15. Encryption - RSA. First, we describe the most known public-key cipher - RSA. The principle of the protocol RSA14 is as follows:
• Every participant A needs a pair of keys - a public one (Va) and a private one (Sa)-
• Key generating: the user selects two large primes p, g, and calculates n = pg, ip(n) = (p — 1) (g — 1). The integer n is public; the idea is that it is too hard to compute p{n).
• Then, the user chooses a public key e and verifies that
(e,tp(n)) = 1.
• Using Euclidean algorithm, the private key d is computed so that e ■ d = 1 (mod <p(n)).
The principle of RSA
The secret communication then runs in the following steps (for the sake of simplicity, we will further identify the encryption procedure with the public key Va and the decryption procedure with the private key Sa)'-
• Encrypting a numerical code of a message M for participant A (by any other participant which has access to the public key Va):
C = VA(M) = Me   (mod n).
• Decrypting the cipher C by participant A:
OT = SA(C) = Cd (modn).
The proof of correctness of this protocol (i.e., that A indeed receives what was meant) is a straightforward application of Euler's theorem: Thanks to 11.3.3, it holds for any message M which is coprime to n that (Me)d = M1 = M (mod n). In the (extremely unlikely) case that the message M would not be coprime to n, the statement holds as well, although the proof needs to be modified with the help of the Chinese remainder theorem (however, we should realize that if the message M with property 0 < M < n is not coprime to n, then it means that (M, n) is a non-trivial divisor of ti, so the key of the receiver is actually discredited).
The security of RSA has been tested since it was invented in 1977, and no meaningful weakness (except for side channels or some singular keys) has been discovered yet (provided a sufficiently large key is used; nowadays it is recommended to use at least 2048 bits). Nevertheless, it has not been proved that the RSA problem really relies on the hardness of integer factorization.
i.e., p = 1 (mod 232) or p = 175 (mod 232). If we are looking for a prime divisor of the integer n = 229 — 1 =
Ron Rlvest, Adi Shamir, Leonard Adleman (1977); C. Cocks, the secret service GCHQ (not publicly) as early as 1973.
786
CHAPTER 11. ELEMENTARY NUMBER THEORY
536 870 911, then it suffices to check the primes (of the above form) up to y^n « 23170. There are 50 of them, so we are able to decide whether n is a prime quite easily (even with paper and pencil). In this case, fortunately, n is divisible already by the least prime, 233. □
11.E.3. Show that the integer 341 is a Fermat pseudoprime to base 2, yet it is not a Euler-Jacobi pseudoprime to base 2. Further, prove that the integer 561 is a Euler-Jacobi pseudoprime to base 2, but not to base 3. Prove that, on the other hand, the integer 121 is a Euler-Jacobi pseudoprime to base 3, but not to base 2.
Solution. The integer 341 is a Fermat pseudoprime to base 2 since 210 = 1 => 2340 = 1 (mod 341). It is not a Euler-Jacobi pseudoprime since 2170 = 1 (mod 341), but (gfj) = -1, which follows from the fact that 341 = -3 (mod 8). For the integer 561, we have 2280 = 1 (mod 561) and (gfi) = 1, since 561 = 1 (mod 8). Therefore, it is a Euler-Jacobi pseudoprime to base 2. But not to base 3, since 3 | 561. On the other hand, the integer 121 satisfies 35 = 1 (mod 121) => 360 = 1 (mod 121) and = 1,
but 260 = 89 ^ 1 (mod 121). □
11.E.4.   Prove that the integers 2465, 2821, and 6601 are A /9<f^S^   Carmichael numbers, i.e., denoting any / \" \ °f them as n, then every integer a co-
y prime to n satisfies
a71'1 = 1   (mod n).
Solution. We have 2465 = 5-17-29, 2821 = 7-13-31, 6601 = 7-23-41, and the proposition follows from Ko-rselt's criterion 11.5.6 since all of the integers 4,16,28 divide 2464 = 25 ■ 7 ■ 11, all of the integers 6,12,30 divide
2820 = 22-3-5-47, and 6,22,40 divide 6600 = 23-3-52-ll.
□
11.E.5. Prove that the integer 2047 is a strong pseudoprime to base 2, but not to base 3. Further, prove that the integer 1905 is a Euler-Jacobi pseudoprime to base 2 but not a strong pseudo-prime to this base.
Solution. In order to verify whether 2047 is a strong pseudo-prime to base 2, we factor
(22046 -l) = (21023 - 1)(21023 + 1).
The requirements on a secure choice of the key for practical reasons are:
• d is large enough (defense against the so-called Wiener's attack),
• p and q are not too close to each other (see the exercise ll.F.l),
• the public key is selected to be at least e = 65537 (although no direct attack against a small public key e has been noticed).
11.5.16. Rabin cryptosystem. Further, we mention a simplified variant of the protocol named Rabin cryptosystem15, which has been the first public cryptosystem where one demonstrably needs to fac-torize the modulus to break it (unlike RSA, for which this has not been proved):
• Every participant A needs a pair of keys - a public one (Va) and a private one (Sa)-
• Key generating: A chooses two large primes of roughly the same size - p, q = 3 (mod 4), and computes n =
pq.
The public key is Va
Sa = (p,q).
n, the private key is the pair
The secret communication then runs as follows:
• Encryption of the numerical code of the message M: C = VA(M) = M2 (mod n).
• Decryption of the cipher C: the (four) roots of C modulo n are computed and it is easily found out which one of them is the original message (for instance, the other three make no sense or do not contain the agreed identification).
As we can see from the description of the protocol, the process of decryption requires the computation of the square root of C modulo n = pq, where p = q = 3 (mod 4). This can be done as follows:
• The values r = C<p+1)/4 (mod p) and s = C<«+1)/4 (mod q) are computed.
• Further, we need to determine the coefficients a,bin Be-zout's identity, i.e., integers for which ap + bq = 1.
• We set x = (aps + bqr) (mod n), y = (aps — bqr) (mod n).
• The square roots of C modulo n are then ±x, ±y.
Let us mention that this is in fact an application of the Chinese remainder theorem and the fact that we are able to easily find the solutions of the quadratic congruence x2 = a (mod p) provided p = 3 (mod 4) (see exercise 11.C.36). Indeed, it holds that
(±a;)2 = (aps + bqr)2 = (bqr)2
= r2 = C(p+1)/2 = C (modp),
^Rabin, Michael. Digitalized Signatures and Public-Key Functions as Intractable as Factorization (in PDF). MIT Laboratory for Computer Science, January 1979.
787
CHAPTER 11. ELEMENTARY NUMBER THEORY
Since 21023 = 1 (mod 2047), the statement is true. However, it is not a strong pseudoprime to base 3 as
1,1023
= 1565 ^ ±1   (mod 2047).
Notice that for the integer 2047, the strong pseudoprimality test is identical to the Euler one (this is because the integer 2046 is not divisible by four).
The integer 1905 is a Euler-Jacobi pseudoprime to base 2 since 21904/2 = 1 (mod 1905) and the Jacobi symbol (2/1905) is equal to 1. Since 1904 = 24 ■ 7 ■ 17, 1905 will be a strong pseudoprime to base 2 only if at least one of the following congruences holds:
2952 _	-1	(mod 1905),
2476 =	-1	(mod 1905),
2238 _	-1	(mod 1905),
2119 =	±1	(mod 1905).
= 2476	=	1 (mod 1905), 2238 = 1144
However, 2952
(mod 1905), and 2119 = 128 (mod 1905). Therefore, 1905 is not a strong pseudoprime to base 2. □
11.E.6.   Applying Pocklington-Lehmer test, show that 1321 is a prime.
Solution. Let us set TV = 1321, then TV -l = 1320 = 23-3-5-ll. For the sake of simplicity, we will assume that the trial division is executed only for primes below 10, then F = 23 ■ 3 ■ 5 = 120, U = 11, where (F, U) = (120,11) = 1.
In order to prove the primality of 1321 by the Pocklington-Lehmer test, we need to find a primality witness ap for each p e {2,3, 5}.
Since (2— - l, 1321J = 1 and (2— - l, 1321J = 1, we can lay a3 = a5 = 2. However, for p = 2, we have ^2i¥a — 1,1321^ = 1321, so we have to look for another primality witness. We can take a2 = 7 since (7^ - 1,132l) = 1. In both cases, we have 21320 = 71320 = 1 (mod 1321). The primality witnesses of the integer 1321 are thus a2 = 7, a3 = a5 = 2. Instead, we could also have chosen for all primes p the same number (e. g. 13), which is a primitive root modulo 1321. □
11.E.7. Factor the integer 221 to primes by Pollard's p-method. Use the function j(x) = x2 + 1 with initial value x0 = 2.
where we made use of the fact that bq = 1 (mod p) and that C = M2 (mod p) is a quadratic residue modulo p, hence C1^-1)/2 = (C/p) = 1 (mod p). Similarly, we have (±x)2 = C (mod q) as well, thus ±x is a square root of C modulo n. The derivation of the same result for y is nearly identical.
11.5.17. Digital signature. Now, let use briefly describe the principle of digital signature.
Principle of the digital signature
Creating the signature:
(1) A digest (hash) Hm of the message is generated, the length of the hash is fixed (160 or 256 bits, for instance) - we should realize that such a mapping is surely not injective (there will be many messages sharing the same hash).
(2) The signature of the message Sa(Hm) is created from this hash using the knowledge of the private key of the signer (similarly to decryption of a message's text).
(3) The message M is sent (optionally encrypted with the public key of the receiver) together with the created signature.
The signature verification then runs as follows:
(1) A digest H'M if generated for the received message M (after decryption, if it has been encrypted)
(2) Using the public key of the (declared) sender of the message, the original digest of the message is reconstructed: Va{Sa{Hm)) = Hm-
(3) The digests are then compared; i.e., it is found out
whether 77
m
The (cryptographic) hash function mentioned above must have the following properties:
• It is easy to find the hash of any message.
• It is impossible (in real time) to find (any) message with the desired hash.
• It is impossible (in real time) to find two messages with the same hash (the function must be collision-resistant).
• Every change of the message changes the hash as well.
The most known examples of such functions are:
• MD5 (128 bit, Rivest 1992) - not collision-resistant
• SHA-1 (160 bit, NSA 1995) - from 2005 considered insufficiently collision-resistant
• RIPEMD-320
• SHA-3
788
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution. Let us seta; = y = 2. The procedure from 11.5.13 gives:
x:=f(x)   y:=f(f(y))   (|z-2/|, 221)   mod 221
5
26 14 197
26 197 104 145
1 1 1
13
We have thus found a non-trivial divisor, so now it is easy to calculate 221 = 13 ■ 17. □
11.E.8.   Find a non-trivial divisor of the integer 455459.
Solution. Consider the function f(x) = x2 + 1 (we silently assume that this function behaves randomly modulo an unknown prime divisor p of the integer n and has the required properties). In the particular iterations, we compute a <—
f(a) (mod n), b <— /(/(&)) (mod n) while evaluating
d = (a — b,n).
a	b	d
5	26	1
26	2871	1
677	179685	1
2871	155260	1
44380	416250	1
179685	43670	1
121634	164403	1
155260	247944	1
44567	68343	743
We have found a divisor 743, and now we can easily compute that 455459 = 613-743. □
F. Encryption
11.F.1. RSA. We have overheard that the integers 29, 7, 21 were sent by means of RSA with public key (7,33). Try to break the cipher and find the messages (integers) that were originally sent.
Solution. In order to find the private key d, we need to solve the congruence Id = 1 (mod y>(33)). However, since the integer 33 is quite small, we can factor it and easily compute that y(33) = (3 - 1)(11 - 1) = 20. We are thus looking for a d such that 7d = 1 (mod 20), which is satisfied by d = 3 (mod 20). Since 293 = (-4)3 = 2, 73 = 13, and 213 = 21 (mod 33), the messages that were encrypted are 2,13, and 21.
□
11.5.18. Diftie-Hellman key exchange system. Another important type of protocol, which is very often used in practice, is a protocol for key exchange in symmetric cryptography - Diffie-Hellman key exchange16, whose discovery was a breakthrough in this discipline, making it possible to replace onetime keys, messengers with cases (and similar) with mathematical means, in particular without the necessity of prior communication of both sides.
The protocol for the agreement of two sides (Alice, Bob) on a common key (integer) is as follows:
Principle of DH key exchange protocol
Both sides agree on a prime p and a primitive root g
modulo p (this need not be done secretly).
Alice chooses a random integer a and sends ga
(mod p).
Bob chooses a random integer b and sends gb (mod p). The common key for the communication is then g
(mod p).
ab
The security of this protocol relies on the fact that it is hard to compute the discrete logarithm (the so-called discrete logarithm problem) - see also part 11.3.5.
There is another encryption algorithm which is based on the Diffie-Hellman key exchange protocol - algorithm ElGa-mal, which we will also describe in short:
• Every participant chooses a prime p with a primitive root 9-
• Further, they choose a private key x, compute h = gx (mod p), and publish the public key (p, g, h).
The secret communication then runs as follows:
• Encryption of the numerical code of the message M: choosing a random y and computing C\ = gy (mod p) and C2 = M ■ hy (mod p), then sending the pair (Ci, C2) to participant A.
• The participant A then decrypts the message by computing c2/ci.
Remark. The mechanism of digital signature can be derived from the ElGamal algorithm just like in the case of RSA.
16 Whitfield Diffie, Martin Hellman (1976); M. Williamson (secret service GCHQ) as early as 1974 (not published).
789
CHAPTER 11. ELEMENTARY NUMBER THEORY
Attacks against RSA.
v\^v       Using so-called Fermat's factorization method, TxfVc we can try to factor n = p ■ q if we think that the W    difference between p and q is small.
Then,
(p + q\2 fp-q\2
n=\—) -{—)'
where s = (p — q)/2 is small and t = (p + q)/2 is only a bit greater than y/n. Therefore, it suffices to check whether3 t = \y/n\,t = \y/n\ + l,t = \^/n] + 2,..., until f2 — n is a square (this condition can, of course, be checked efficiently).
11.F.2. Now, we will try to factor the integer 7i = 23104222007 this way. (We anticipate that it is a product of two close primes.)
Solution. We compute
^^ 152000,731
and check the candidates for t:
For t = 152001, we have Vf2 - n « 286,345. For t = 152002, it is Vf2 - n « 621,287. For t = 152003, Vt2 - n « 830,664. Finally, for t = 152004, we get Vt2 - n = 997 e Z. Therefore, s = 997 and we can easily calculate the prime divisors ofn:p = t + s = 153001, q = t-s = 151007.
□
11.F.3.   The RSA modulus n = p ■ q can also be easily factored if the integer p(n) is known (compromised). Then,
p(n) = (p—l)(q—l) = pq— (p+q)+l, odkudp+g = n+l—<p(n).
We are thus to find two integers whose sum and product are known, which can be done by Viete's formulas relating the roots and the coefficients of a polynomial, whence it follows that p and q are, the roots of the polynomial
x2 — (ti + 1 — p(n))x + 7i.
The symbol \x\ denotes the ceiling if a real number x, i.e., the it is the integer which satisfies \x] — 1 < x < \x\.
790
CHAPTER 11. ELEMENTARY NUMBER THEORY
11.F.4. Consider (as above) the integer n = 23104222007 and factor it with the additional knowledge that <p(n) = 23103918000.
Solution. Following the procedure described above, we get the quadratic equation
x2 - 304008a; + 23104222007 = 0,
whose solutions are
p = 1(304008 + \/3040082 - 4 ■ 23104222007 = 153001, q = 1(304008 - \/3040082 - 4 ■ 23104222007 = 151007.
□
11.F.5. ElGamal. Martin and John want to communicate using the ElGamal encryption (designed by Egyptian mathematician Taher ElGamal, who was inspired by the Diffie-Hellman key exchange protocol). Martin chose the prime 41 and its primitive root g = 11 as well as the integer 10. Then, he published the triple (41,11, A), where A = ll10 (mod 41); he kept the integer 10 to himself - it is his private key. John used a public channel to send the pair (22,6) to him. What is the original message John sent?
Solution. For completeness, we will first compute the whole public key A = 9 (however, this integer was only needed by John when he encrypted the message for Martin; it is no longer necessary for decryption). The message M can be obtained as M = (6/2210) (mod 41). First, we compute
2210 = 222- (222)2- ((222)2) = = (-8) ■ (-8)2 ■ (-8)2 = = (-8) ■ 23 ■ 23 = -9   (mod 41)
and (—9)_1 = 9 (mod 41). Therefore, the decrypted message is the integer
M = 9 ■ 6 = 13 (mod 41) . □
11.F.6. Rabin cryptosystem. Alice has chosen p = 23, q = 31 as her private key in Rabin cryptosystem. The public key is n = pq = 713, then. Encrypt the message M = 327 for Alice and show how Alice will decrypt it.
791
CHAPTER 11. ELEMENTARY NUMBER THEORY
Solution. We compute C = (327)2 = 692 (mod 713) and send this cipher to Alice. According to the decryption procedure, we determine
r = C(p+1)/4 = 692^ = 18   (mod 23),
s = C(9+1)/4 = 692^ = 14   (mod 31),
and further the coefficients a, b into Bezout's identity 23 ■ a + 31 ■ b = 1 (using the Euclidean algorithm). We get a = —4, b = 3; the candidates for the original message are thus the integers +4 ■ 23 ■ 14 ± 3 ■ 31 ■ 18 (mod 713). We thus know that one of the integers
386,603,110,327
is the message that was sent. □
11.F.7. Show how to encrypt and decrypt the message M = 321 in Rabin cryptosystem with n = 437. Solution. The encrypted text can be obtained as the square modulo n: C = 3212 = (-116)2 = 13456 = 346 (mod 437). On the other hand, when decrypting, we will use the factorization (its knowledge is the private key of the message receiver) n = 437 = 19-23, and we compute r = 346iTi = 3465 = 17 = -2 (mod 19) and s = 24613*11 = 3466 = 1 (mod 23). Applying Euclidean algorithm to the pair (19,23) = 1, we determine the coefficients into Bezout's identity
19- (-6) + 23- 5 = 1.
Then, the message is one of the integers ±6-19-l±5-23-(—2) (mod 437), i.e., M = ±116 or M = ±344. Indeed, M =
-116 = 321 (mod 437). □
792
CHAPTER 11. NUMBER THEORY
G. Additional exercises to the whole chapter ll.G.l. Prove that there are infinitely many odd natural numbers k such that the integer 22™ + k is composite for every n eN.
o
11.G.2. Prove that for every integer k =^ 1, there are infinitely many natural numbers n such that the integer 22™ + k is composite. O
11.G.3. Consider the sequence (a„)^L1 defined by
an = 2™ + 3™ + 6™ - 1
. Prove that for every prime p, this sequence contains a multiple of p. O
11.G.4. Prove that no natural number n greater than 1 satisfies n \ 2n — 1. O
11.G.5. Prove that for every odd prime p, there are infinitely many natural numbers n such that p | n ■ 2n + 1. O
//.G.fj. Let a function / : N -> N satisfy (/(a), /(&)) = (/(a), f(\a - b\)) for all a, b e N. Prove that (/(a), /(&)) = /((a, 6)). Show that this implies the result of exercise 11.A.6 as well as the fact that (Fa, F0) = F(ab), where Fa denotes the a-th term of the Fibonacci sequence. O
793
CHAPTER 11. NUMBER THEORY
Key to the exercises
9.B.6. 4tt. 9.B.7. 36tt.
9. B.8. ff.
10. C.10. | • | + | • 1 = |.
10.E.17. Simply, a = §. Thus, the distribution function of the random variable X is Fx (t) = |t3 for t e (0, 2), zero for smaller values of t, and one for greater. Let Z = X3 denote the random variable corresponding to the volume of the considered cube. It lies in the interval (0, 8). Thus, for t e (0,8) and the distribution function Fz of the random variable Z, we can write Fz{t) = P[Z < t] = P[X3 < t] = P[X < = Fx (v7*) = |t. Then, the density is fz (t) = | on the interval (0, 8) and zero elsewhere. Since this is the uniform distribution on the given interval, the expected value is equal to 4.
10.F.9. EU = 1 • 0.6 + 2 • 0.4 = 1.4, EU2 = 0.4 + 4 • 0.6 = 2.8 EV = 0.4 + 0.6 + 1.2 = 2.1, EV2 = 0.3 + + 1.2 + 3.6 = 5.1, E(UV) = 2.8, var([/) = 2.8 - 1.42 = 2.8 - 1.96 = 0.84, var(V) = 5.1 - 4.41 = 0.69, cov(W) = 2.8 - 1.4 • 2.1 = -0.14,
— -°-14
PU,V v/O.84-0.69'
10.F.10. EX = 1/3, var2 X = 4/45. 10.F.11.
px,y = — 1.
10. F.12. guy = -0,421.
11. B.21.
i) The integer 3 has order 4 modulo 10, so it suffices to determine the remainder of the exponent when divided by 4. This remainder is equal to 1, so the last digit is 31 = 3.
ii) 37 = —3 (mod 10) is of order 4. Again, it suffices to compute the remainder of the exponent upon division by 4. However, we apparently have 37 = 1 (mod 4), so the wanted remainder upon division by 10 equals (—3)1 =7, and the last digit is thus 7.
ill) Since (12,10) > 1, it makes no sense to talk about the order of 12 modulo 10. However, the examined integer is clearly even, so it suffices to find its remainder upon division by 5. The order of 12 = 2 (mod 5) is 4, and the exponent satisfies 1314 = l14 = 1 (mod 4). We thus have 1213   = 21 (mod 5), and since 2 is an even integer, it is the wanted digit as well.
11.B.23. Since p(n) < n, we surely have p(n) \ n\, whence the statement already follows as odd positive integers n satisfy 2*''"' = 1 (mod n).
11.C.7. i) The greatest common divisor of the moduli is 3, and 1^—1 (mod 3), so the system has no solution.
ii) The condition for solvability of linear congruences, (8,12345678910111213) = 1, is clearly true, so this congruence has a unique solution.
in) The moduli are coprime, so by the Chinese remainder theorem, there is a unique solution modulo 29 • 47. 11.C.10. Since 2 is a primitive root modulo both 5 and 13, we get that
2" = 3   (mod 5) 2" = 23   (mod 5) n = 3   (mod 4)
and
2" = 3   (mod 13) 2" = 24   (mod 13) n = 4   (mod 12).
This apparently implies the infinitude of the multiples of both 5 and 13 among the integers 2" — 3 in question. On the other hand, we can see that none of them can be a multiple of 5 and 13 simultaneously since the system of congruences n = 3 (mod 4), n = 4 (mod 12) has no solution.
ll.G.l. If n is an arbitrary natural number, then 22 =1 (mod 3), so it suffices to choose for k odd positive integers with k = 2 (mod 3). And there are surely infinitely many of them - they are those which satisfy k = 5 (mod 6). For these values of k, we always have that 22 + k is a multiple of 3 and greater than 3, so it is a composite number.
11.G.2. Let us fix an integer k e Z \ {1} and an arbitrary a e N. We will show that for an arbitrarily large a, we can find a positive integer n such that 22 + k is composite and greater than a. That will complete the proof.
Let us fix s e No, h e Z so that k - 1 = 2s ■ h, 2 \ h, and m e N satisfying 22™ > a - k. Now, let an I satisfy I > s, I > m. If the integer 2 + k is composite, then we are done, since 2 +k > 2 +k > a. Therefore, let us assume that the integer 2 + k is a prime and denote it by p. With help of Euler's theorem, we can find an integer of the desired form which is a multiple of p. We have
p- 1 = 22' + 2s ■ h = 2s ■ hi,
794
CHAPTER 11. NUMBER THEORY
where hi e N is odd. We thus have 2v(hl) = 1 (mod hi), whence 2s+v(hl) = 2s (mod p - 1), and since I > s, we also have
2i+v<-hl) = 2l (modp-1). Now, it follows from Fermat's little theorem that
22 + fc = 22+/c = 0   (mod p).
However, since 2       11 > 2 , we also have 2 +   > 2   + k = p > a. We have thus found a composite number which is of the
wanted form and greater than an arbitrarily large value of a.
Let us mention that the case of k = 1 is a well-known open problem examining the infinitude of Fermat primes.
11.G.3. We can easily see that 2 | ai = 10 and 3 | a? = 48. Further, we can show that p \ aP-2 holds for any prime p > 3. By Fermat's theorem, we have 2P~1 = 3P_1 = 6P_1 = 1 (mod p). Therefore,
6aP-2 = 3-2p_1+2-3p_1+6p_1-6 = 3+2+1-6 = 0   (mod p).
Let us remark that knowledge of algebra allows us to proceed more directly: for p > 3, we can consider the p-element field FP, which contains multiplicative inverses of the elements 2, 3, and 6 and their sum is| + | + | = l.
11.G.4. We could reason about the factorization of n to primes, which is a bit complicated. Instead, we will use a little trick. Suppose that there is an n satisfying the conditions n \ 2" — 1, n > 1, and let us select the least one. Surely, n is odd, hence n | 2^(n) - 1. Utilizing the result of exercise 11. A.6, we get that n \ 2d — 1, where d = (n,(p(n)) (which especially implies that 2d — 1 > 1 andd > 1). Atthe same time, d < ip(n) < n and d | n, whence it follows that d | 2d — 1, which contradicts the assumption that our n is the least one that meets the conditions.
11.G.5. Since 2P_1 = 1 (mod p), it suffices to choose appropriate multiples of p — 1 for n, i. e., to find a k so that n = k(p — 1) would satisfy the condition n ■ 2" = — 1 (mod p). However, thanks to p — 1 \ n, this is equivalent to k = 1 (mod p), and there are clearly infinitely many such values k.
11.G.6. Analyze the Euclidean algorithm for computing the greatest common divisor.
795
CHAPTER 12
Algebraic structures
The more abstraction, the more chaos? - no, it is often the other way round...
A. Boolean algebras and lattices
12.A.1. Find the (complete) disjunctive normal form of the proposition
(B1 =>C) A P V C) A B]'.
Solution.
If the propositional formula contains only a few variables (in our case, it is three), the most advanta-' 8eous Procedure is to build the truth table of the
formula and build the disjunctive normal form from that. The table will consist of 23 = 8 rows. The examined formula is denoted ip.
In this chapter, we begin a seemingly very formal study. But the concepts reflect many properties of things and phenomena surrounding us. This is one of the parts of the book which is not in the prerequisites of any other chapter. Large parts serve as a quick illustration of interesting uses of mathematical tools and models.
The simplest properties of real objects are used for encoding in terms of algebraic operations. Thus, "algebra" considers algorithmic manipulations with letters which usually correspond to computations or descriptions of processes.
Strictly speaking, this chapter builds only on the first and sixth parts of chapter one, where abstract views on numbers and relations between objects are introduced. But it is a focal point for abstract versions of many concepts already met.
The first two sections aim at direct generalizations of the familiar algebraic structure of numbers. This leads to a discussion of rings of polynomials. Only then we provide an introduction to group theory, for which there is only a single operation.
The last two sections provide some glimpses of direct applications. The construction of (self-correcting) codes often used in data transfer is considered. The last section explains the elementary foundations of computer algebra. This includes solving polynomial equations and algorithmic methods for manipulation and calculations with formal expressions.
1. Posets and Boolean algebras
Familiarity with the properties of addition and multiplication of scalars and matrices is assumed. Likewise, the binary operations of set intersection and union, in elementary set theory, as indicated in the end of the first chapter. We proceed to work with symbols which stand for miscellaneous objects resulting in the universal applicability of the results.
This allows the relating of the basic set operations, to propositional logic which formalizes methods for expressing propositions and evaluating truth values.
12.1.1. Algebraic operations. For any set M, there is a set K = 2M consisting of all subsets of M, together with the operations of union V : K x K —> K and intersection A :
CHAPTER 12. ALGEBRAIC STRUCTURES
A	B	c	B => C	[(A V C) A B]'	
0	0	0	0	1	0
0	0	1	1	1	1
0	1	0	1	1	1
0	1	1	1	0	0
1	0	0	0	1	0
1	0	1	1	1	1
1	1	0	1	0	0
1	1	1	1	0	0
k x k —> k. This is an instance of an algebraic structure on the set k with two binary operations. In general, write (k, V, A). In the special case of sets, these binary operations are denoted rather by U and n, respectively.
To every set A e k, its complement A = k \ A can be assigned. This is another operation ' : k —> k with only one argument. Such operations are called unary operations.
In general, there are algebraic structures with k operations pi,..., [ik, each of them
The resulting complete disjunctive normal form is the disjunction of the formula that correspond to the rows with one in the last column (the formula is true for the given valuation of the atomic propositions). The row corresponds to conjunction of the variables (if the corresponding value is 1) or their negations (if it is 0). In our case, it is the disjunction of conjunctions corresponding to the second, third, and sixth rows, i. e., the result is
(A A B A C) V (A A B A C) V (A A B A C).
We can also rewrite the formula by expanding the connective => with A and V, using the De Morgan laws and dis-tributivity:
(B1    C) A [(a V C) A B]' <=h ^{B V C) A [(a V C)' V B'\ ■!==> <^(BVC)A [(a' AC') V B'} <==>
[(BVC) A (a AC')] V [(B V (7) A B'j <^ <^=> [(BAA' AC?) V (CAA'aC)] V [(BAB') V (C A B')] <^=> (B A a' A C) V (C A B') ,
which is an (incomplete) disjunctive normal form of the given formula. Clearly, it is equivalent to our result above (the word "complete" means that each disjunct (called clause in this context) contains each of the three variables or their negations (these are called literals). □
12.A.2. Find a disjunctive normal form of the formula
((A A B) V c)' A (A V (B A C A D))
O
We know several logical connectives: A, V, =>, = and the unary'. Any prepositional formula with these connectives can be equivalently written using only some of them, for instance V and'. There are also connectives which alone suffice to express any propositional formula. From binary connectives, these are NAND and NOR (A NAND B = (A A B)1,
Pi
ij -times
K x---xK
k
with ij arguments, and write (K, p1,..., p^) for such a structure. The number ij of arguments is called the parity of the operation ("unary", "binary", etc.). If ij = 0, then the operation has no arguments which means it is a distinguished element in k.
With subsets in k = 2M, there is the unique "greatest object", i.e. the entire set m, which is neutral for the A operation. Similarly, the empty set 0 G k is the only neutral element for V.
12.1.2. Set algebra. View the algebraic structure on the set k = 2M from the previous paragraph as (k, V, A,', 1,0), with two binary operations, one unary operation (the complement), and two special elements 1 = m, 0 = 0.
It is easily verified that all elements A, b, c e k satisfy the following properties:
Axioms of Boolean algebras
(1) A a (b a c) = (A a b) a c,
(2) A V (b V c) = (A V b) V c,
(3) aab = baa, avb = bva, 04)       aa(bvc) = (Aab)v(Aac),
(5) aw (b ac) = (aw b) a (avo),
(6) there is a 0 e K such that A V 0 = A,
(7) there is a 1 e K such that A a 1 = A,
(8) AaA = 0,   A\/A' = l.
Compare these properties with those of the scalars
-sa (K,+,-,0,1): Properties (1) and (2) say that lli^£ both the operations A and V are associative. /'■iprnM^l- Property (3) says that both operations are also commutative. So far, this is the same as for the addition and multiplication of scalars. Also there are neutral elements for both operations there.
However, the properties (4) and (5) are stronger now: they require the distributivity of A over V as well as V over A. Of course, this cannot be the case for addition and multiplication of numbers. In the case of numbers, multiplication distributes over addition but not vice versa.
Properties (6)-(8) require the existence of neutral elements for both operations as well as the existence of an analogy to the "inverse" of each element. (Note however, that the
797
CHAPTER 12. ALGEBRAIC STRUCTURES
ANORB = (A\ZB)'). Try to express each of the known connectives using only NAND, and then only NOR. These connectives are implemented in electric circuits as the so-called "gates".
12.A.3. Express the propositional formula (A => B) using only the NAND gates. O
12.A.4.   Simplify the formula
((A A B) V (A => B)) A ((B ^(T)V(BAC)).
Solution. In Boolean algebra, we obtain
(a-b + a' + b) ■ (b + c+b - c') = ■■■ = a' ■ c + b.
This means that the given formula is equivalent to (A' A C) V B. □
12.A.5. Anne, Brenda, Kate, and Dana want to set out on a trip. Find out which of the girls will go if the following must hold: At least one of Brenda and Dana will go; at most one of Anne and Kate will go; at least one of Anne and Dana will go; at most one of Brenda and Kate will go; Brenda will not go unless Anne goes; and Kate will go if Dana goes.
Solution. Transforming the problem to Boolean algebra, simplifying it, and transforming it back, we find out that either Anne and Brenda will go or Kate and Dana will go. □
12.A.6. Solve the following problem by transforming it to A^_Lv Boolean algebra: Tom, Paul, Sam, Ralph, and Lucas ^AfT^f are suspected of having committed a murder. It is tp certain that at the crime scene, there were: at least one of Tom and Ralph, at most one of Lucas and Paul, and at least one of Lucas and Tom. Sam could be there only if so was Ralph. However, if Sam was there, then so was Tom. Paul could never cooperate with Ralph, but Paul and Tom are an inseparable pair. Who committed the murder?
Solution. Transforming into Boolean algebra, using the first letter of each name, we get
(t + r)(l' + p')(l + t)(r + s')(s' + t)(p' + r')(pt + p't')
and thanks to x2 = x, xx' = 0, x + x' = 1, we can rearrange the above to s'r'ptl' + s'rp't'l. Thus, the murder was committed either by Tom and Paul or by Ralph and Lucas. □
intersection with the complement results in the neutral element for union and vice versa. This is the other way round for numbers.)
There are only a few structures that possess such restrictive properties.
Boolean algebras
Definition. A set K together with two binary operations A, V, a unary operation ', and two special elements 1, 0, which satisfy the properties (l)-(8) is called a Boolean algebra. The A operation is called infimum or meet, and the V operation is called supremum or join. The element A' is called the complement of A.
Note that the axioms of Boolean algebras are symmetric with respect to interchanging the operations A and V together with the interchange of 0 and 1. This means that any proposition that can be derived from the axioms has a valid dual proposition, created by interchanging A with V and 0 with 1. This is the principle of duality.
12.1.3. Properties of Boolean algebras. As usual, we derive several elementary corollaries of the axioms. In particular, note that in the special case of the Boolean algebra of all subsets of a given set, this proves all the elementary properties known from set algebra. The special elements 1 and 0 are unique as the neutral elements for A and V. If there is 6 with the same properties then 6 V 0 = 0 = 6. Similarly 1 is also unique. Next, if B, C e K satisfy the properties of A' (axiom (8) in the above definition), then
B = BV0 = BV(iA(7) =
= (B V A) A (B V C) = 1 A (B V C) = B V C
and similarly
C = CV B = B\/C.
Therefore, B = C and so the complement to any A e K is determined uniquely by its properties.
Generally the above observation means that given (K, A, V), there exists at most one operation ' which completes it to a Boolean algebra (K, A, V,', 1,0), together with unique elements 1 and 0. Generally (K, A, V) is written, omitting the other three symbols for operations.
The properties of the following proposition have their own names in set algebra: (2) is called absorption laws, (3) is the idempotency of the operations R A and V, and the equalities of (4) are the De Mor-gan laws.
798
CHAPTER 12. ALGEBRAIC STRUCTURES
12.A.7. A vote box for three voters is a box which processes three votes and outputs "yes" if and only if majority of the voters is for. Design this box using from switch circuits.
Solution.
□
12.A.8. Find a finite subset of the set of positive integers which is not a lattice with respect to divisibility. O
12.A.9. Find the number of partial orders on a given 4-element set. Draw the Hasse diagram of each isomorphism class and determine whether it is a lattice. Is one of them a Boolean algebra?
Solution. We go through all Hasse diagrams of the partial orders on a 4-element set M and for each diagram, we count the number of partial orders (i. e. subsets of M x M) that correspond to it; see the picture:
•   • 0 •		U	•A	•V		N	M
1		41 z	*			7.'	f
		«!	V	Y			i
	v. t>		Hi				i HI
Therefore, there are 219 partial orders on a given 4-element set.
Note that the condition of existence of suprema and in-fima of any pair of elements in a lattice implies (by induction) the existence of them for any finite non-empty subset. In particular, this means that every non-empty finite lattice has a greatest element as well as a least element.
Using this criterion, we can see that only the last two Hasse diagrams may be lattices. Indeed, they are lattices; the first one is even a Boolean algebra. □
12.A.10. Find the number of partial orders on the set {1,2,3,4,5} such that there are exactly two pairs of incomparable elements. O
Further properties of Boolean algebras
Proposition. In every Boolean algebra (K, A, V);
(1) 4A0 = 0,   4V1 = 1,
(2) AA(AV B)= A,   Av(AaB) = A,
(3) A a A = A,   A v A = A,
(4) (A a BY = A'vB,    (A v B)' = A' a B,
(5) (A1)' = A. for all A, B e K.
Proof. By the principle of duality, it suffices to prove only one of the claims in each item. Begin with (3), of course using just the axioms of the Boolean algebras
A = A a 1 = A a (A v A') = (A a A) V 0 = A a A.
Now, (1) is proved easily:
A ao = A a (A a A') = (A a A) a A' = A a A' = 0.
(2) is also easy (read the second equality from right to left):
A a (A v B) = (A V 0) A (A v B) =
= Av (0 ab) = Av 0 = A.
In order to prove the De Morgan laws, it suffices to verify that A' v B' has the properties of the complement of A a B. By the above, it must be the complement. Using (1), compute
(A a B) a {A1 v B) = ((A a B) a A1) v ((A a B) a B) = (0AB)V(AA0) = 0.
Similarly,
(AaB)v (A'v B) = {Ay {A1 y B)) a{By {A1 y B)) = (iv B) a (lv A') = l.
Finally, from the definition, A' a A = 0 and A' v A = 1. Hence, A has the required properties of the complement of A', which means that A=(A')'. □
12.1.4. Examples of Boolean algebras. The intersection /©i .      and union of subsets in a given set M always '^Tt^'^ define a Boolean algebra. The smallest is the '''I^Tr'' -    set °f u" subsets of a singleton M. It contains 3*4?Tr?—   two elements, namely 0 = 0 and 1 = M with the obvious equalities 0Al = 0,0Vl = l,etc. The operations A and V are the same as multiplication and addition in the remainder class ring Z2 of even and odd numbers. This is called the Boolean algebra Z2. This is the only case when a Boolean algebra is a field of scalars at the same time!
As in the case of rings of scalars or vector spaces, the algebraic structure of a Boolean algebra can be extended to all spaces of functions whose codomain is a Boolean algebra. For the set S = {/ : M —> K} of all functions from a set M to a Boolean algebra (K, A, V), the necessary operations and the distinguished elements 0 and 1 on S can be defined
799
CHAPTER 12. ALGEBRAIC STRUCTURES
12.A.11. Draw the Hasse diagram of the lattice of all (positive) divisors 36. Is this lattice distributive? Is it a Boolean algebra?
Solution. The lattice distributive (it does not contain a sub-lattice isomorphic to diamond or pentagon).
12.A.12. Draw the Hasse diagram of the lattice of all (positive) divisors 30. Is this lattice distributive? Is it a Boolean algebra?
Solution. This lattice is a Boolean algebra, and it has 8 elements. All finite Boolean algebras are of size 2™ for an appropriate n, and they are all isomorphic for a fixed n (see 12.1.16). This Boolean algebra is a "cube": its graph can be drawn as projection of a cube onto the plane.
do
'1
□
as functions of an argument ieMas follows:
(/iA/2)(z) = (/i(z))A(/2(z))etf,
(/iv/2)(z) = (/i(z))v(/2(z))etf, (l)(x) = leK, (0)(x) = o e K, (f)'(x) = (f(x)yeK.
It is easy and straightforward to verify that these new operations define a Boolean algebra.
Recall that the subsets of a given set M can be viewed as mappings M —> Z2 (the elements of the subset in question are mapped to 1 while all others go to 0). Then, the union and intersection can be defined in the above manner — for instance, evaluating the expression (A V B)(x) for a point x e M, This determines whether it lies in A or whether it lies in B, and whether the join of the results is in Z2. The result is 1 if and only if x lies in the union.
12.1.5. Propositional logic. The latter simple observation brings us close to the calculus of elementary ■XiA    logic.  View the notations for operations in a Boolean algebra as creating "words" from the el-^^SQiP~" ements A, B,... e K, the operations V, A,' and parentheses, which clarify the desired precedence of the operations.
The axioms of the Boolean algebras and their corollaries say how different words may produce the same result in K.
This is clear in the case of K = 2M, the set of all subsets of a given set; it is just equality of subsets. Now, another interpretation is mentioned in terms of operations in formal logic.
Work with words as above but view them as propositions composed from elementary (atomic) propositions A,B,... and the logical operations AND (the binary operation A), OR (the binary operation V), and the negation NOT (the unary operation ')■ These words are called propositions. They are assigned a truth value depending on the truth values of the individual atomic propositions. The truth value is an element of the trivial Boolean algebra Z2, i.e. either 0 or 1.
The truth value of a proposition is completely determined by assigning the truth values for the simplest propositions AA B, A\J B and A'. A A B is defined to be true if and only if both A and B are true. A V B is false if and only if both A and B are false. The value of A' is complementary to A.
A proposition with n elementary propositions defines a function (Z2)n —> Z2. Two propositions are called logically equivalent if and only if they define the same function. In the previous paragraph, it is already verified that the set of all classes of logically equivalent propositions has the structure of a Boolean algebra. Propositional logic satisfies everything proved for general Boolean algebras.
Next, we consider how other usual simple propositions of propositional logic are represented as elements of the Boolean algebra. Expressions always correspond to a class of logically equivalent propositions:
800
CHAPTER 12. ALGEBRAIC STRUCTURES
12.A.13. Decide whether every lattice on a 3-element set is a chain, i. e., whether each pair of elements are necessary comparable.
Solution. As we have noticed in exercise 12.A.9, every finite non-empty lattice must contain a greatest element and a least element. Each of these is thus comparable to any other, which means that the remaining one is comparable with these two; and there are no other elements. □
12.A.14. Find an example of two lattices and a poset ho-momorphism between them which is not a lattice homomor-phism.
Solution. Again, we return to exercise 12.A.9 and consider the following mapping:
'1
□
12.A.15. Decide whether every lattice homomorphism between finite non-empty lattices K, L maps the least element of K to the least element of L.
Solution. No, any constant mapping between two lattices is a lattice homomorphism. Thus, if we sent everything to an element different from the least one, we get the wanted counterexample homomorphism. □
12.A.16. Decide whether every chain which has a greatest element and a least element is a complete lattice.
Solution. No. Consider the set of non-zero integers and order it as follows: any positive integer is greater than any negative integer, but the ordering among the positive integers will be reversed, as well as among the negative integers. Then, 1 will be the greatest element of the resulting chain, and —1 will be the least element. However, the subset of all positive integers does not have an infimum in this poset.
The standard logical operators
A AND B corresponds to A A B, A ORB corresponds toAvB, the implication A^ B can be obtained as A' V B, (4) the equivalence A       B corresponds to (A A B) V
(A'AB),
the exclusive OR, known as A XOR B, is given as (AA
B) V (A' A B),
the negation of OR, A NOR B, is expressed as A' A B, the negation of AND, A NAND B, is given as A1 V B, tautology (proposition always true) is given in terms of an arbitrary atomic proposition as A V A', its negation (always false) is A A A'.
(1)
(2)
(3)
(5)
(6)
(7)
(8)
Note that in set algebra, XOR corresponds to the symmetric difference.
12.1.6. Switch boards as Boolean algebras. A switch is a black box with only two states - it is either on (and the signal goes through) or off (and the signal does not go through).
s4
One or more switches may be interconnected in a series circuit or a parallel circuit. The series circuit corresponds to the binary operation A, while the parallel circuit corresponds to V. The unary operation A' defines a switch whose state is always the opposite than that of A. Every finite word created from the switches A,B,... and the operations A, V, and ' can be transformed to a diagram that represents a system of switches, connected by wires, similarly as in the above subsection, where each choice of states of the individual switches gives the value "on/off' for the entire system. The discussion is about switchboards and their logical evaluation function.
Again, it is easy to verify all the axioms of Boolean algebras for the system. The following diagram illustrates one of the distributivity axioms.
o—
The circuit without a switch corresponds to 1. When the endpoints are not connected, this corresponds to 0 (consider a series circuit of A and A'). Draw diagrams for all axioms of Boolean algebras and verify them!
We return to this example shortly, showing that each expression in propositional logic can be modeled by a switch board.
12.1.7. Algebra of divisors. There are other natural examples of Boolean algebras, Choose a positive integer p e N. The underlying set Dp is the set SitSS^ of all divisors q of p. For two such divisors q, r, define q A r to be the greatest common divisor of q and r, and q V r is denned to be their least common multiple (cf. the
801
CHAPTER 12. ALGEBRAIC STRUCTURES
Formally we define the linear order -< on Z \ {0} by:
a -< b        [(sgn(a)-sgn(b) = lAb > a)V(sgn(a) > sgn(b))]
□
12.A.17. Give an example of an infinite chain which is a complete lattice.
Solution. We can take the set of real numbers together with —oo, oo, where —oo is the least element (and thus the supre-mum of the empty set) and oo is the greatest element (and thus the infimum of the empty set). The lattice suprema and infima are thus denned in accordance with these concepts in the real numbers. Moreover, — oo is the infimum of subsets which are not bounded from below, and similarly oo is the supremum of subsets which are not bounded from above. □
12.A.18. Decide whether the set of all convex subsets of R3 is a lattice (with respect to suitably denned operations of suprema and infima). If so, is this lattice complete, distributive?
Solution. It is a lattice. The infimum is simply the intersection, since the intersection of convex subsets is again convex. The supremum is the convex hull of the union. It is clear that the lattice axioms are indeed satisfied for these operations (think this out!).
The lattice is complete, since the above operations work for infinite subsets as well, and clearly, the lattice has both a least element (the empty set) and a greatest element (the entire space).
However, the lattice is not distributive. For example, consider three unit balls B1, B2, B3 centered at [3,0,0], [-3, 0,0], [0,0,0], respectively. Then,
Kx V (K2 A K3) = Kx ± (Kx V K3) A (Kx V K2).
□
previous chapter for the definitions and context). The distinguished element 1 G Dp is denned to be p itself. The neutral element 0 for join on Dp is the integer 1 G N. The unary operation ' is denned using division as q' = p/q.
Proposition. The set Dp together with the above operations A, V, and' is a Boolean algebra if and only if the factorization ofp contains no squares (i.e., in the unique factorization p = qi... qn, the irreducible factors qi are pairwise distinct).
Proof. It is easy to verify the axioms of Boolean algebras under the assumptions of the proposition. It might be interesting to see where the assumption squarefree is needed.
The greatest common divisor of a finite number of integers is independent of their order. This also holds for the least common multiple. This corresponds to the axioms (1) and (2) in 12.1.2. The commutativity (3) is clear.
For any three elements a, b, c, write their factorizations without loss of generality as a = q\1 ... q?% b = q™1 ... q™s, c = qkl.. .qks. Zero powers are allowed and all qj are pair-wise coprime. Thus, a A b G Dp corresponds to the element in which each qi occurs with the power that is the minimum of the powers in a and b. This holds analogously for a V b and maximum. The distributivity laws (4) and (5) of 12.1.2 now follow easily.
There is no problem with the existence of the distinguished elements 0 a 1. These are already denned directly and clearly satisfy the axioms (6) and (7). However, if there are squares in the factorizations, then this prevents the existence of complements. For instance, in Di2 = {1, 2,3,4,6,12}, 6 A 6' = 1 cannot be reached since 6 has a non-trivial divisor which is common to all other elements of Di2 except for 1, but 6Vl = 6^12. (The number 1 is the potential smallest element in Di2; it plays the role of 0 from the axioms).
Nevertheless, if there are no squares in the factorization of the integer p, then the complement can be denned as q' = p/q, and it can be verified easily that this definition satisfies the axiom 12.1.2(8). □
If there are no squares in the decomposition of p, then the number of all divisors is a power of 2. This suggests that these Boolean algebras are very similar to the set algebras we started with. We return to the classification of all finite Boolean algebras. Before that, we consider structures like the divisors above for general p.
12.1.8. Partial order. There is a much more fundamental concept, the partial order. See the end of chapter 1. Recall that the definition of a partial order is a reflexive, antisymmetric, and transitive ■Mw^tsm*—~ relation < on a set K. A set with partial order (K, <) is called & partially ordered set, or poset for short. The adjective "partial" means that in general, this relation does not say whether a < b or b < a for every two different elements a, b G K If it does for each pair, it is called a linear order or a total order.
802
CHAPTER 12. ALGEBRAIC STRUCTURES
12.A.19. Decide whether the set of all vector subspaces of R3 is a lattice (with respect to suitably denned operations of suprema and infima). If so, is this lattice complete, distributive?
Solution. This is a lattice, infima correspond to intersections and suprema to sums of vector spaces (it is easy to verify that these operations satisfy the lattice axioms).
This lattice is complete (the operations work for infinite subsets as well, the least element is the zero-dimensional sub-space, and the greatest element is the entire space).
However, it is not distributive (consider three lines in a plane). □
B. Rings
12.B.1. Decide whether the set R with the operations ffi, © form a ring, a commutative ring, an integral domain or a field:
i)	R =	Z,	a(Bb =	a + b + 3,aQb =	-3,	
ii)	R =	Z,	a(Bb =	a + b — 3, aQb =	a-b-1,	
iii)	R =	z,	a(Bb =	a + b-\,aQb =	a + b — a■	b,
iv)	R =	q,	a(Bb =	a + b, a 0 b = b,		
v)	R =	q,	a(Bb =	a + b+l,aQb =	a + b + a	■ b,
vi)	R =	q,	a(Bb =	a + b-\,aQb =	a + b + a	■ b.
o
Solution.
i) not a ring (but is a commutative rng),
ii) not a ring,
iii) an integral domain,
iv) not a ring,
v) afield,
vi) not a ring. D
12.B.2. Prove that the subset Z[i] = {a + bi \ a, b e Z} of the complex numbers is an integral domain. Is it a field?
Solution. Any subring of an integral domain must be an integral domain again. In this case, we are talking about a subset of the field C (thus also an integral domain). Since the subset is closed with respect to all the operations (sum, additive inverse, multiplication) and contains both 0 and 1, it is indeed a subring. However, multiplicative inverses exist only for the numbers 1, i, —1, —i (these form the so-called subgroup of units - invertible elements), so it is not a field. □
There is always a partial order on the set K = 2M of all subsets of a given set M - the inclusion. In terms of intersections or joins, the inclusion can be defined as A C B if and only if A A B = A, or equivalently, A C B if and only if A V B = B. In general, each Boolean algebra is a very special poset:
Lemma. Let (K, A, V) be a Boolean algebra. Then the relation < defined by A < B if and only if A A B = A is a partial order. Moreover, for all A,B,C G K:
(1) AAB < A,
(2) A< Av B,
(3) if A <C andB <C, then a V B < C,
(4) A<B if and only if A A B = 0,
(5) 0 <AandA < 1.
Proof. All the properties to be proved are results of simple calculations in the Boolean algebra K. Begin with the properties of a partial order for <. Reflexivity is a direct corollary of idempotency: A A A = A, i.e. A < A. Similarly, the commutativity of A guarantees the antisymmetry of <, since if both AaB = A and B f\ A = B, then
A = Af\B = Bf\A = B.
Finally, if A f\B = A and B f\C = B, then
AaC={AaB)AC = Aa{B AC)=AAB = A,
which verifies the transitivity of <.
Similarly, (A A B) A A = (A A A) A B = A A B, that is, A A B < A.
It follows from A A (A V B) = A, see 12.1.3(2), that A < A V B. This proves the claim (2).
Distributivity together with assumption (3) provides
(A V B) A C = (A A C) V (B A C) = A V B,
so that (3) holds.
The proposition (5) follows directly from the axioms for the distinguished elements 1 and 0.
It remains to prove (4). If A < B,then AaB = AaBA B = 0. On the other hand, if AAB' = o.thenA = a A1 = AA(BVB') = (AAB)V(AAB) = (AAB)V0 = AAB. Hence A < B, and the proof is finished. □
Note that as for the algebra of subsets, in all Boolean algebras AAB = A if and only if A V B = B. HAaB = A, then the absorption laws imply that Aw B = (AaB)VB = B, and vice versa. Therefore, the operation V can also be used in the definition of a partial order.
Every poset (K, <) corresponds to a (oriented) graph (cf. the begining of chapter 13 for deffinitions if necessary): the vertex set is K, and there is an edge leading from a to & if and only if a < b. This is a convenient way how to represent finite posets.
A Hasse diagram of a poset is a drawing of this graph in the plane so that greater elements are drawn above lower ones. Since the edge orientation is implicitly given by this, it need not be drawn explicitly. Furthermore, loops and edges
803
CHAPTER 12. ALGEBRAIC STRUCTURES
12.B.3.   In the ring of 2-by-2 matrices over the real num
a —b
bers, consider the subring of matrices of the form Prove that this subring is isomorphic to C.
b a
Solution. We will show that the isomorphism is given by the
mapping p : ^ i-> a + ib. The multiplication in the
subring works as follows:
a   —b\   (c —d b    a )   \d c
ac — bd —be — ad be + ad    ac — bd
and in, C, we have (a + ib)(c + id) = ac — bd + i(bc+ ad). Hence we can see that p is a homomorphism with respect to multiplication. Since addition is denned componentwise, p is a homomorphism to it as well. Moreover, this mapping is clearly both injective and surjective, thus it is an isomorphism.
□
12.B.4. Prove that the identity is the only automorphism of the field of real numbers.
Solution. Consider an automorphism p : R —> R. Clearly, it must satisfy p(0) = 0 and p(l) = 1. Since p respects addition, we must have for all positive integers n that p(n) = p(l + 1 + ■ ■ ■ + 1) = np(l) = n and <p(—n) = —n. Since it respects multiplication, we must have for any integers p, q (q ^ 0) that pip) = p(q ■ |) = p(p)cdotp(^). Hence, p(q) = |, i. e., p(r) = r for all rational numbers r.
Consider a positive number x e R. Then, p(x) = p (v7^2) = tp(V%)2 > > 0. Thus, for any x,y g R such that x < y, we must have p(x) < p(y). Now, assume that p is not the identity, i. e., there exists a z e R such that p(z) ^ z. We can assume without loss of generality that p(z) < z. Since Q is dense in R, there exists an r for which p(z) < r < z. However, we know that p(r) = r, which means that r < z implies p(r) < p(z). Altogether, we get the wanted contradiction p(z) < p(r) < p(z). □
12.B.5. Let p be a prime and R a ring which contains p2 elements. Prove that R is commutative.
Solution. Since (R, +) is a finite commutative group with p2 elements, it is by 12.3.8 isomorphic to either Zp2 or ZpxZp. In the first case, (R, +) is cyclic, so there exists an element x e R such that each element of R is of the form nx for some 1 < n < p2. Since all these elements commute, we get that the entire R is commutative.
In the second case, each element (except 0) must have order p with respect to addition. Let x e R be any element
which are implied by transitivity and reflexivity are omitted in the diagram. Especially when K has only a few elements, this is a very transparent way of discussing several cases; see the examples in the exercise column.
12.1.9. Lattices. Not every poset is created the latter way from a Boolean algebra. For instance, the trivial partial order is denned on any set as A < A for each A, | but all pairs of different elements are incomparable. Such a poset cannot rise from a Boolean algebra if K contains more than one element (as seen, the least and greatest elements of a Boolean algebra are comparable to every element).
Think to what extent the operations A and V can be built from a partial order. They are the suprema and infima in the following definition:
lower and upper bounds, suprema, infima
Consider a fixed poset (K, <). An element C e K is said to be a lower bound for a subset L C K if and only if C < A for all AgL. An element C e K is said to be the greatest lower bound (or infimum) of a subset L C K if and only if it is a lower bound and for every lower bound D of L, D < C.
By replacing < with > in the above, the definitions of an upper bound and of the least upper bound (or supremum) of a subset L are obtained.
If the suprema and infima exist for all couples A, B, they define the binary operations V and A, respectively.
Lattices
Definition. A lattice is a poset (K, <) where every two-element set {^4, B] has a supremum A V B and an infimum
AAB.
The poset (K, <) is said to be a complete lattice if and only if every subset of K has a supremum and an infimum.
The binary operations A and V on a lattice (K, <) are clearly commutative and associative (prove this in detail!). The latter properties (of associativity and commutativity) ensure that all finite non-empty subsets in K possess infima and suprema.
Note that any element of a lattice K is an upper bound for the empty set. Thus in a complete lattice, the supremum of the empty set is the least element 0 of K. Similarly, the infimum of the empty set is the greatest element 1 of K. Of course, a finite lattice (K, <) is always complete (with 1 being the supremum of all elements in K and 0 the infimum of all elements in K).
A lattice is said to be distributive if and only the operations A and V satisfy the distributivity axioms (4) and (5) of subsection 12.1.2 on page 797. There are lattices which are not distributive; see the Hasse diagrams of two such simple lattices below (and check that in both cases x A (y V z) ^ (x A y) V (x A z)).
804
CHAPTER 12. ALGEBRAIC STRUCTURES
that is not in the additive subgroup generated by 1. Then, each element of R is of the form m + nx, where 1 < m,n < p. Again, all these elements commute, so R is commutative. □
12.B.6. Find the inverses of 17,18, and 19 in (Zf31, ■) (the group of all invertible elements in Z131 with multiplication).
Solution. Applying the Euclidean algorithm, we get
131 = 7-17+12, 17= 1 ■ 12 + 5, 12 = 2-5 + 2, 5 = 2-2 + 1.
Therefore, 1 = 5- 2- 2 = 5-2(12- 2-5) = 5- 5- 2-12 = 5-(17-12)-2-12 = 5-17-7-12 = 5-17-7-(131-7-17) = 54-17-7-131. Theinverseof 17is54. Similarly, [18]"1 = 51
and [19]-1 = 69.		□
12.B.7. Find the inverse of [49]z253	in Z253	0
12.B.8. Find the inverse of [37]z20S	inZ208-	0
12.B.9. Find the inverse of [57]z359	inZ359.	0
12.B.10. Find the inverse of [17]z40	in Z40.	0
Now, Boolean algebras can be denned in terms of lattices: a Boolean algebra is a complete distributive lattice such that each element has its complement (i.e. the axiom 12.1.2(8) is satisfied).
It is already verified that the latter requirement implies that complements are unique (see the ideas at the beginning of subsection 12.1.3), which means that the alternative definition of Boolean algebras is correct.
During the discussion of divisors of any given integer p, distributive lattices Dp are encountered. These distributive lattices are Boolean algebras if and only if p is squarefree, see 12.1.7.
C. Polynomial rings
12.C.1. Eisenstein's irreducibility criterion This criterion provides a sufficient condition for a polynomial over Z to be irreducible over Q (which is the same as to be irreducible over
Z):
Let
f(x) = anxn + ari-\xn~1 + ■ ■ ■ + a\x + ag
be a polynomial over Z and p be a prime such that
• p divides cij, j = 0,..., n — 1,
• p does not divide an,
• p2 does not divide a0.
Then, j(x) is irreducible over Z (Q). Prove this criterion.
o
12.C.2.   Factorize over C and R the polynomial
x4 + 2x3 + 3x2 + 2x + 1.
Solution. This polynomial can be factorized either by looking for multiple roots or as a reciprocal equation:
12.1.10. Homomorphisms. Dealing   with mathematical
structures, most information about objects can be obtained/understood from the homomorphisms.   These are mappings
which preserve the corresponding operations.   The linear mappings between vector spaces, or continuous mappings on R™ or any metric spaces with the given topology of open neighbourhoods represent very good examples. This concept is particularly simple for posets:
POSET HOMOMORPHISMS
Let (K, <k) and (L, <L) be posets. A mapping / : K —> L is called a poset homomorphism (also order-preserving mapping, monotone mapping or isotone mapping) if for all A <k B the same relation /(A) <L f(B) is true.
Although the structure of any Boolean algebra is completely determined by its subordinated poset structure, an isotone mapping does not necessarily respect the suprema and infima. Non-comparable elements A, B can be mapped to the same image /(A) = f(B), while their suprema could map to /(A V B) strictly larger.
In the case of Boolean algebras, homomorphisms are defined as follows:
805
CHAPTER 12. ALGEBRAIC STRUCTURES
Let us compute the greatest common divisor of the polynomial and its derivative 4x3 + Qx2 + Qx + 2, using the Euclidean algorithm. The greatest common divisor is given in any ring up to a multiple by a unit, and during the Euclidean algorithm, we may multiply the partial results by units of the ring. In the case of a polynomial ring over a field of scalars, the units are exactly the nonzero scalars. We perform the multiplication in the way to avoid calculations with fractions as much as possible.
2x4 + 4x3 + 6x2 + Ax + 2 : 2x3 + 3x2 + 3x + 1 = x +
2x4 + 3x3 + 3x2 + x
x3 + 3x2 + 3x + 2
,    3 2    3 1 x3 + -xl + -x + -
2 2 2
3 2    3 3
-xl + -X + -
2        2 2 Further, we divide the polynomial 2x3 + 3x2 + 3x + 1 by the remainder |x2 + |a;+| (multiplied by the unit
§)
2x3 + 3x2 + 3x + 1 : x2 + x + 1 = 2x + 1 2x3 + 2x2 + 2x
x2 + x + 1
The roots of the greatest common divisor of the original polynomial and its derivative are exactly the multiple roots of the original polynomial. In this case, the roots of x2 + x + 1 are — | ± iV~3/2, which are thus double roots of the original polynomial. The factorization over C is thus to root factors (this is always the case over C, as stated by the fundamental theorem of algebra): x4 + 2x3 + 3x2 + 2x + 1 =
1 .V3~Y   (     1 .V3"2 x H---i- ■ \ x H---h i —
2 2 J     y      2 2
The factorization over R can be obtained by multiplying the factors corresponding to pairs of complex-conjugated roots of the polynomial (verify that such a product must always result in a polynomial with real coefficients!):
x4 + 2x3 + 3x2 + 2x + 1 = (x2 + x + I)2 .
Lattice and Boolean-algebra homomorphisms
A mapping / : (K, A, V) —> (L, A, V) is a homomorphism of Boolean algebras if and only if for all A, B e K
(1) f(AAB) = f(A)Af(B),
(2) f(A VB) = f{A) V f(B),
(3) f{A') = f(Ay.
Moreover, if / is bijective, it is an isomorphism of Boolean algebras.
Similarly, lattice homomorphisms are defined as mappings which satisfy the properties (1) and (2).
1 It is easily verified that if a homomorphism / is bijective, then /_1 is also a homomorphism.
It is clear from the definition of the partial order on Boolean algebras or lattices that every homomorphism / : K -4- L also satisfies f(A) < f(B) for all A,B e K such that A < B, i.e. it is in particular a poset homomorphism.
The converse of the above is generally not true, that is, it may happen that a poset homomorphism is not a lattice homomorphism.
'Ml"
12.1.11. Fixed-point theorems. Many practical problems Agw lead to discussion on the existence and properties of fixed points of a mapping / : K —> K on a set K, i.e. of elements x e K such that f(x) = x. The concepts of infima and suprema allows the derivation of very strong propositions of this type surprisingly easily. There follows here a classical theorem proved by Knaster and Tarski1:
Tarski's fixed-point theorem
Theorem. Let (K, A, V) be a complete lattice and f : K —> K a poset homomorphism. Then, f has a fixed point, and the set of all fixed points of f, together with the restricted ordering from K, is again a complete lattice.
Proof. Denote M = {x e K; x < f(x)}. Since K has a least element, M is non-empty. Since / is order-preserving, f(M) C M. Moreover, denote «i = supM. Then, for x e M, x < z\, which means that f(x) < At the
same time, x < f(x), hence is an upper bound for M. Then z1 < and also z1 e M, and hence        < z1.
It follows that f(zi) = z\, so a fixed point is found.
Let us solve the equation
x4 + 2x3 + 3x2 + 2x + 1
Dividing by x2 and substituting t equation
t2 + 2t + 1 = 0
^Knaster and Tarski proved this in the special case of the Boolean algebra of all subsets in a given set already in 1928, cf. Ann. Soc. Polon. Math. 6: 133-134. Much later in 1955, Tarski published the general result, cf. Pa-X + we get the cific Journal of Mathematics. 5:2: 285-309. Alfred Tarski (1901-1983) was a renowned and influential Polish logician, mathematician and philosopher, who worked most of his active career in Berkeley, California. His elder colleague Bronislaw Knaster (1893-1980) was also a Polish mathematician.
806
CHAPTER 12. ALGEBRAIC STRUCTURES
with double root — 1. Now, substituting this into the definition of t, we get the known equation a2 + a + 1 = 0, which was solved above. ^ Remark. Let us remark that the only irreducible polynomials over R are linear polynomials and quadratic polynomials with negative discriminant. This also follows from the reasonings in the above exercise.
12.C.3. Factorize the polynomial a5+3a3+3 to irreducible factors over
i) Q,
ii) Z7.
Solution.
i) By Eisenstein's criterion, the given polynomial is irreducible over Z and Q (we use the prime 3).
ii) (a — l)2(a3 + 2x2 — x + 3). Using Horner's scheme, for instance, we find the double root 1. When divided by the polynomial (a — l)2, we get (a;3 + 2x2 — x + 3), which has no roots over Z7. Since it is only of degree 3, this means that it must be irreducible (if it were reducible, one of the factors would have to be linear, which means that the cubic polynomial (a3 + 2a2 — a + 3) would have a root).
12.C.4.   Factorize the polynomial a4 + 1 over • Z3,
Solution.
• (a2 + a + 2)(a2 + 2a + 2)
• The roots are the fourth roots of — 1, which lie in the complex plane on the unit circle, and their arguments are 7r/4, 7r/4 + 7r/2, 7t/4 + 7r, and tt/4 + 3ir/2 i. e., they are the numbers ±s/2/2 ± is/2/2. Thus, the factorization is
□
s/2 2
.s/2 ' 2
s/2 . s/2 ~ + % ~
x+ ■
s/2
.s/2 ' 2
s/2
+ i
Multiplying the root factors of complex-conjugated roots in the factorization over C, we get the factorization over
s/2x + l) (x2 + V2x + 1)
□
It is more difficult to verify the last statement of the theorem, namely that the set Z C K of all fixed points of / is a complete lattice. The greatest element z1 = max Z is found already. Using infimum and the property /(a) < a in the definition of M, we could analogously find the least element zq = min Z.
Consider any non-empty set Q C Z and denote y = sup Q. This supremum need not lie in Z. However, as seen shortly, the set has a supremum in Z with respect to the partial order in K, restricted to Z. For that purpose, denote R = {a e K; y < a}. It is clear from the definitions that this set together with the partial order in K, restricted to R, is again a complete lattice, and that the restriction of / to R is again a poset homomorphism f^:R^R.By the above, f\R has a least fixed point y. Of course, y e Z, and y is the supremum of the fixed set Q with respect to the inherited order on Z. Note that it is possible that y > y. Analogously, the infimum of any non-empty subset of Z can be found. Since the least and greatest elements are already found, the proof is finished.
□
Remark. In the literature, one may find many variants of j.',, the fixed-point theorems, in various contexts. One of very useful variants is Kleene's recursion theo-I rem, which can be determined from the theorem just ™ 1 proved and formulated as follows: Consider a poset homomorphism / and a countable subset of K (using the notation of Tarski's fixed-point theorem), formed by the Kleene chain
0</(0)</(/(0))<....
Then, the supremum z of this subset cannot be greater than any fixed point of /. If y is a fixed point of /, it follows from 0 < y that /(0) < f(y) = y, etc. Moreover, if / is assumed continuous in a certain sense of reasonably preserving suprema, then it can be shown that f(z) is also the supremum of this chain and hence is a fixed point. Therefore, it is the smallest fixed point. This theorem is called Kleene's fixed-point theorem. It has many applications in recursion theory, when discussing termination of algorithms, etc. s/2 \ We omit details about the necessary "continuity" of mappings between posets and further generalizations.2 We point out the added value to the general formulation of Tarski's theorem — the Kleene's theorem provides an iterative computational process approaching the fixed point with the given "seed", the minimal point.
12.C.5. Find a polynomial with rational coefficients of the lowest degree possible which has 20°\/2 as a root.
Solution. P(x) = a2007—2. Let us show that there is no polynomial of lower degree with root 20 \/2: Let Q(x) be a nonzero polynomial of the lowest degree with root 20 \/2. Then,
Stephen Cole Kleene (1909-1994) was a famous American mathematician working with Church, Turing, Post and others. The interested reader may consult full exposition of the above mentioned theorem in chapter 1 of the book: Mathematical Theory of Domains, Cambridge Tracts in Theoretical Computer Science, Cambridge University Press, 1994, by V. Stoltenberg-Hansen, I. Lindström, E. R. Griffor.
807
CHAPTER 12. ALGEBRAIC STRUCTURES
deg Q(x) < 2007. Let us divide P(x) by Q(x) with remainder: P(x) = Q(x)-D(x)+R(x), where D(x) is the quotient and R(x) is the remainder, and either deg R(x) < st Q(a;) or R(x) = 0. Substituting the number 20\/2 into the last equation, we can see that 20\/2 is also a root of R(x). By the definition of Q(x), this means that R(x) must be the zero polynomial, which means that Q(x) divides P(x). However, P(x) is irreducible (by Eisenstein's criterion for 2), so its only non-trivial divisor is itself (up to multiplication by a unit of the polynomial ring over Q, i. e. a non-zero rational constant). Thus, we have Q(x) = P(x) up to multiplication by a unit. For instance, the polynomial ^a;2007 — | also satisfies the stated conditions. However, if we require the polynomial to be monic (i. e., with leading coefficient 1), then the only solution is the mentioned polynomial P(x). □
12.C.6. Find all irreducible polynomials of degree at most 2 over Z3.
Solution. By definition, all linear polynomials are irreducible. As for quadratic irreducible polynomials, the easiest way is to simply enumerate them all and leave out the reducible ones, i. e. those which are a product of two linear polynomials. The reducible polynomials are (x + l)2 = x2 + 2x + 1, (x + 2)2 = x2 + x + 1, (x + l)(x + 2) = a;2 + 2, a;2, x(x + 1) = x2 + x, x(x + 2) = x2 + 2x. (It suffices to consider monic polynomials, since the non-monic can be obtained by multiplication by 2.) The remaining quadratic polynomials over Z3 are irreducible; these are a;2 + 2a; + 2, a;2 + x + 2, a;2 + 1. □
12.C.7. Decide whether the following polynomial is irreducible over Z3; if not, factorize it:
x4 + x3 + x + 2.
Solution. Evaluating the polynomial at 0, 1, 2, we find that it has no root in Z3. This means that it is either irreducible or a product of two quadratic polynomials. Assume it is reducible. Then, we may assume without loss of generality that it is a product of two monic polynomials (the only other option is that it is a product of two polynomials with leading coefficients equal to 2 - then both can be multiplied by 2 in order to become monic). Thus, let us look for constants a, b,
12.1.12. Back to Boolean algebras. When discussing **^\    prepositional logic, there is the problem of what
,.I vJT exactly are the elements of the corresponding *'<30r~ Boolean algebra. Formally, they are defined as "C^iSP " the classes of equivalent propositions. In other words, we work with truth-value functions for propositions with a given number of arguments. There is the problem of recognizing propositions which are equivalent in this sense. There is the question of whether every function (Z2)n —> Z2 can be defined in terms of the basic logical operations. Clearly all such functions form a Boolean algebra, since their values are in the Boolean algebra Z2.
Similarly, there is the problem of deciding whether or not two systems of switches can have the same function. Just as for propositions, a system consisting of n switches corresponds to a function (Z2)n —> Z2. There are 22" such functions. A Boolean algebra can be naturally defined on these functions (again using the fact that the function values are in the Boolean algebra Z2).
We summarize a few such questions:
some basic questions
Question 1: Are all finite Boolean algebras (K, A, V) defined on sets K with 2n elements?
Question 2: Can each function (Z2)n —> Z2 be the truth function of some logical expression built of n elementary propositions and the logical operators?
Question 3: How to recognize whether two such expressions represent the same function?
Question 4: Can each function (Z2)n —> Z2 be realized by some switch board with n switches?
Question 5: How to recognize whether two switchboards represent the same function?
All these questions are answered by finding the normal form of every element of a general Boolean algebra. This is achieved by writing it as the join of certain particularly simple elements. By comparing the normal forms of any pair of elements, it is easily determined whether or not they are the same.
This helps to classify all finite Boolean algebras, giving the affirmative answer to question 1.
12.1.13. Atoms and normal forms. First, define the "simplest" elements of a Boolean algebra:
Atoms in a Boolean algebra
Let if be a Boolean algebra. An element A e K, A =^ 0, is called an atom if and only if for all B e K, A A B = A or
4AB=0.
In other words, A =^ 0 is an atom if and only if there are only two elements B such that B < A, namely B = 0 and
B = A.
Note that 0 is not considered an atom, just as the integer 1 is not considered a prime.
808
CHAPTER 12. ALGEBRAIC STRUCTURES
c, d e Z3 so that
x4 + x3 + x + 2 = (x2 + ax + b) (x2 + cx + d) =
= x4 + (a + c)x3 + (ac+b + d)x2 + + (ad + bc)x + bd.
Comparing the coefficients of individual power of x, we get the following system of four equations in four variables:
1   =   a + c,
0 =   ac + b + d,
1 =   ad + be,
2 = bd.
From the last equation, we get that one of the numbers b, d is equal to 1 and the other one to 2. Thanks to symmetry of the system in the pairs (a, b) and (c, d), we can choose b=\, d = 2. From the second equation, we get ac = 0, i. e., at least one of the numbers a, c is 0. From the first equation, we get that the other one is 1. From the third equation, we get 2a + c = 1, i. e., a = 0, c = 1. Altogether,
+ x3 + x + 2 = (x2 + l)(x2 + x + 2).
□
12.C.8. For any odd prime p, find all roots of the polynomial
P(x)
over the field Z„.
— tP-2   i p-3
+ xp 3 H-----\-x + 2
Solution. Considering the equality
x"'1-l = {x-l){P{x)-l),
we can see that all numbers of Zp, except 0 and 1, are roots of P(x) — 1, so they cannot be roots of P(x) + 1. Clearly, 0 is never a root of P(x), and 1 is always a root, which means that it is the only root.. □
12.C.9.   Factorize the polynomial p(x) = x2 + x + 1 in
Z5[x] andZr[x].
Solution. Irreducible in Z5 [x]; p(x) = (x — 2) (x — 4) in
Zr[x]. □
12.C.10.   Factorize the polynomial p(x) = x6— x4—5x2—3
in C[x], R[x], Q[x], Z[x], Z5[x], Z7[x], knowing that it has a multiple root.
Solution. Applying the Euclidean algorithm, we find out that the greatest common divisor of p and its derivative p' is x2+1. Dividing the polynomial p(x) twice by this factor, we get
p(x) = (x2 + l)2(x2-3).
The situation is very simple in the Boolean algebra of all subsets of a given finite set M. Clearly, the atoms are precisely the singletons A = {x}. For every subset B, either A A B = A (if x e B) or A A B = 0 (if x <£ B). The requirements fail whenever there is more than one element in A.
Next, consider which elements are atoms in the Boolean algebra of functions of the switch boards with n switches A1,..., An. It can be easily verified that there are 2n atoms, which are of the form A"1 A- ■ ■ AA^p, where either A"' = A{ oxAT=A\.
The infimum ip A t/j of functions p and tp is the function whose values are given by the products of the corresponding values in Z2. Therefore, p < ip if P takes the value 1 £ Z2 only on arguments where ip also has value 1. Hence in the Boolean algebra of truth-value functions, a function p is an atom if and only if p returns 1 e Z2 for exactly one of the 2n possible choices of arguments. All these functions can be created in the above mentioned manner.
Now, the promised theorem can be formulated. While this one is called the disjunctive normal form, there is also the opposite version with the suprema and infima interchanged (the conjunctive normal form).
Disjunctive normal form
Theorem. Each element B of a finite Boolean algebra (K, A, V) can be written as a supremum of atoms
B = Ax V ■ ■ ■ V Ak.
This expression is unique up to the order of the atoms.
The proof takes several paragraphs, but the basic idea is quite simple: Consider all atoms Ax,A2,..., Ak in K which are less or equal to B. From the properties | of the order on K, (see 12.1.8(3)) it follows that
Y = Ax V ■ ■ ■ V Ak < B. The main step of the proof is to verify that B AY' = 0, which by 12.1.8(4) guarantees that B < Y. That proves the equality B = Y.
12.1.14. Three useful claims. We derive several technical properties of atoms, in order to complete the proof of the theorem on disjunctive normal form. We retain the notation of the previous subsection.
Proposition. (1) If Y, Xx, ■ ■ ■, Xi are atoms in K, then Y < Ii V ... V I( if and only if Y = Xi for some i = 1, ...,£.
(2) For each Y G K, Y =f^ 0, there is an atom X G K such thatX < Y.
(3) IfXx, ■ ■ ■, Xr are precisely all the atoms ofK, then Y = 0 if and only ifY A Xi = Ofor alii = \,... ,r.
then
Proof. (1) If the inequality of the proposition holds,
Y A (Xx V ■ ■ ■ V      = Y.
809
CHAPTER 12. ALGEBRAIC STRUCTURES
Clearly, these factors are irreducible in the rings Q[x] and
Z[x].
In C [x], we can always factorize a polynomial to linear factors. In this case, it suffices to factorize x2 + 1, which is easy: x2 + 1 = (x + i) (x — 1). The factor x2 — 3 is equal to even in R [x]. Thus, in C [x], we have
p(x) = (x + i)2(x - i)2 (x - y/3) (x + y/3) ,
while in R[x], we have
p(x) = (x2 + l)2 (x - v^) (x + v^) .
In Z5 [x], the polynomial x2 + 1 has roots ±2, and the polynomial x2 — 3 has no roots, which means that
p{x) = {x-2)2{x + 2)2{x2-i).
In Z7 [2], neither polynomial has a root, so the factorization to irreducible factors is identical to that in Q[x] and Z[x].
p(x) = (x2 + l)2(x2-3). □
12.C.11.   Knowing that the polynomial p = x6 +x5 +4x4 +
2x3 + 5x2 + x + 2 has multiple root x = i, factorize it to irreducible polynomials over C [x], R [x], Z2 [x], Z5 [x], and Z7 [x]. Divide the polynomial q = x2y2 + y2 + xy + x2y + 2y + 1 by the irreducible factors of p in R[x], and use the result to solve the system of polynomial equations p = q = 0 over C.
Solution, p = (x2+l)2(x2+x+2),mZ2: p = x(x+l)5,in Z5: p = (x-2)2(x+2)2(x2+x+2),inZT: p = (a;2+l)2(a;+ 4)2. For the second polynomial, we get q = (y2 +y)(x2 +x+ 2) - y2(x + 1) + 1 and q = (y2 + y)(x2 + 1) + y(x + 1) + 1. Thus, if x = a is a root of x2 +x + 2, i. e., a = — \ ± then y = ^l+a ■ If x = (3 is a root of x2 + 1, i. e., (3 = ±i, theny = -^ □
12.C.12. Factorize the following polynomial to irreducible polynomials in R[x] and in C[x]:
4x5 - 8a;4 + 9a;3 - 7a;2 + 3a; - 1.
O
12.C.13. Factorize the following polynomial to irreducible polynomials in R[x] and in C[x]:
x5 + 3a;4 + 7a;3 + 9a;2 + 8a; + 4.
O
12.C.14. Factorize a;4 - 4a;3 + 10a;2 - 12a; + 9 to irreducible polynomials in R [x] and in C [x]: O
By distributivity, the equality can be rewritten as
(FAli) V-V(FM,) = Y.
However, for all i either Y A X{ = 0 or Y A X{ = X{. If all these intersections are 0, then Y = 0. Thus, there is an i for which Y A X{ = X{. Since Y is also an atom, the desired equality Y = X{ is proved.
The other implication is trivial.
(2) If Y is an atom itself, choose X = Y. If Y is not an atom, then it follows from the definition that there must exist a non-zero element Z1 for which Z1 < Y. If Z1 is not an atom either, then similarly find aZ2 < Z\, etc., leading to a sequence of pairwise distinct elements
...Zk< Zk-! <---<Z1<Y,
which cannot be infinite since the entire Boolean algebra K is finite. Therefore, it must end with an atom Zk-
(3) Assume that Y A X, = 0 for all indices i. If Y ^ 0, then due to the above claim, there must exist an atom Xj for which Xj AY = Xj, which is a contradiction.
The other implication is trivial. □
12.1.15. Proof of theorem 12.1.13. Write
Y = Ax V ■ ■ ■ V Ak < B,
where A{ are all the atoms in K which are less then or equal to B. Compute
B A Y' = B A {A1 V ■ ■ ■ V Ak)' = B A A\ A ■ ■ ■ A A'k.
If an atom A = Ai is contained in the join Y, then B A Y' A A = 0. However, if A is an atom which does not occur in Y, then also B A Y' A A = 0, since Y contains exactly those atoms which are < B. Hence B A A = 0.
Thus it is proved that the intersection oi B AY' and any atom is zero, which means that it must be zero itself, by the third claim in the letter proposition. Therefore, B < Y (cf. 12.1.8(4)). The definition of Y implies Y < B, so the antisymmetry of the order implies that B = Y.
It remains to prove the uniqueness of the expression, up to order. Thus, suppose B can be written in two ways as
B = A1 V ■ ■ ■ V Ak = A1 V ■ ■ ■ V At.
Since each A{ satisfies A{ < B, the first claim in the proposition above ensures it must equal one of the Aj. Repeating this argument gives the desired uniqueness and finishes the proof.
12.1.16. Classification. To end the discussion of Boolean al-
gebras, we prove that all the examples of finite "■S^ Boolean algebras (of given size) are isomorphic. f^^«sS£ ^n particular, each of the 22" truth-value func-~ tions for n atomic propositions can be expressed as an appropriate proposition, just like each of the 22" switch board functions can be defined in terms of n suitably arranged switches. In both cases, the algebra in question behaves the same way as the Boolean algebra of all subsets of a given 2n-element set.
810
CHAPTER 12. ALGEBRAIC STRUCTURES
12.C.15. Decide whether the following polynomial over Z3 is irreducible; if not, factorize it to irreducible polynomials:
x5 + x2 + 2x + 1.
O
12.C.16. Decide whether the following polynomial over Z3 is irreducible; if not, factorize it to irreducible polynomials:
x4 + 2x3 + 2.
O
12.C.17.   Find all monic quadratic irreducible polynomials
over Z5.
Solution. We write out all monic quadratic polynomials over Z5 and exclude those which are not irreducible, i. e., have a root:
x2 ± 2, x2 ± x + 2, x2 ± 2x - 2, x2 - x ± 1, x2 ± 2x - 1.
□
D. Rings of multivariate polynomials
12.D.1. Find the remainder of the polynomial x3y + x + yz + yz4 with respect to the basis (x2y + z, y + z) and the orderings <iex, ^resolution. □ For illustration, we present examples of several varieties denned by polynomials.
12.D.2. Curves in the afline plane R2. Every non-zero polynomial f(x, y) in two variables defines a "curve" in R2 by the equation f(x, y) = 0. Thus, it is the set of zero points of a polynomial /, and it will be denoted K = V(f). You can derive that if / = fx ■ ■ ■ fk, then V(/) = V(A)U- ■ ■ UV(/fc). The subsequent pictures depict examples of such curves.
12.D.3.   Using your favorite software, draw the curve given
by the equation a;3 + x2 — y2 = 0 in the plane.
Solution. See picture 1. □
12.D.4. Using your favorite software, draw the curve given by the equation 2a;4 — 3a;2y + y2 — 2y3 +y4 = 0 in the plane.
Solution. See picture 2. □ We can also attempt to denned curves by equations x = f(t),y = g(t), where f,g e R[i]. In that case, the curve is denned as a "polynomial inclusion" of the real line into the plane.
Moreover, each of these expressions can be written in a unique normal form, so it can be decided algorithmically whether two switch boards have the same behaviour without comparing their values for all 2n possible inputs (which on the other hand might still be faster, in particular the resulting normal formula tends to be exponentially large).
Theorem. Every finite Boolean algebra is isomorphic to the Boolean algebra K = 2M where M is the set of atoms in K.
Proof. The idea of the proof is quite straightforward. Every isomorphism of a Boolean algebra (K, A, V) must map atoms to atoms. Let M be the set of all atoms in K and consider the Boolean algebra (2M, n, U). This defines a natural correspondence between the atoms of K and the atoms of 2M.
Next, use the disjunctive normal form to extend the mapping to all of K. Each element X e K can be written uniquely (up to order) as a join of atoms:
X = Ax V ■ ■ ■ V Ak
Define the function / : K -> 2M by
f(X) = f(Ax) U ■ ■ ■ U f(Ak) = {Ax,..., Ak},
as the union of the singletons 4; C M that occur in the expression.
The uniqueness of the normal form implies that / is a bijection. It remains to show that it is a homomorphism of the Boolean algebras.
Let X,Y e K. The normal form of their supremum contains exactly the atoms which occur in at least one of X, Y; while the infimum involves just those atoms which occur in both. This verifies that / preserves the operations A and V. As for the complements, note that an atom A occurs in the normal form of X if and only if X A A = 0. Hence / preserves complements, which finishes the proof. □
The classification of infinite Boolean algebras is far more complicated. It is not the case that each would be isomorphic to the Boolean algebra of all subsets of an appropriate set M. However, every Boolean algebra is isomorphic to a Boolean subalgebra of a Boolean algebra 2M for an appropriate set M. This result is known as Stone's representation theorem3.
2. Polynomial rings
The operations of addition and multiplication are fundamental in the case of scalars as well as vectors. There are other similar structures. Besides the integers Z, rational numbers Q and complex numbers C, there are polynomials over similar scalars K to be considered.
The American mathematician Marshall Harvey Stone (1903 - 1989) proved this theorem in 1936 when dealing with the spectral theory of operators on Hilbert spaces, required for analysis and topology. Nowadays, it belongs to standard material in advanced textbooks.
811
CHAPTER 12. ALGEBRAIC STRUCTURES
Figure i. V3(x3 + x2 — y2)
	
\ 1.8-	
\                        \ 1.6-	
\ \14-	
\	
	
\ "/	
\ r6"	
\ I04"	
N.     lo.2 -	
-'l '     -0.5'      0        05' i
x
Figure 2. V3(2x4 - 3x2y + y2 - 2y3 + y4)
12.D.5.   Parametrize the curve (variety) %}(x3 + x2 — y2).
Solution. The parametrization can be found by computing intersections of lines y = tx with the given curve, i. e., we parametrize by the tangent of these lines. Technically, this means that we substitute tx for y and express x in terms of t from the equation:
x3 + x2 - t2x2 = x2(x + l- t) => x = t- l V x = 0.
Among others, the abstract algebraic theory can be in many aspects viewed as a straightforward generalization of divisibility properties of integers.
12.2.1. Rings and fields. Recall that the integers and all other scalars K have the following properties:
Commutative rings and integral domains
Definition. Let (M, +, ■) be an algebraic structure with two binary operations + and •. It is a commutative ring, if it satisfies
• (a + b) + c = a + (b + c) for all a, b, c G M;
• a + b = b + a for all a, b G M;
• there is an element 0 such that for all a G M, 0+a = a;
• for each a e M, there is the unique element —a e M such that a + (—a) = 0.
• (a ■ b) ■ c = a ■ (b ■ c) for all a, b, c G M;
• a ■ b = b ■ a for all a, b G M;
• there is an element 1 such that for all a e M, 1 ■ a = a;
• a ■ (b + c) = a ■ b + a ■ c for all a, b, c G M.
If the ring is such that c ■ d = 0 implies either c or d is zero, then it is called an integral domain.
The first four properties define the algebraic structure of a commutative group (M, +). Groups are considered in more detail in the next part of this chapter. The last property in the list of ring axioms is called distributivity of multiplication over addition. There are similar axioms for Boolean algebras where each of the operations is distributive over the other.
If the operation "•" is commutative for all elements, then the ring is called a commutative ring. Otherwise, the ring is called a non-commutative ring. In the sequel, rings are commutative unless otherwise stated. Traditionally, the operation "+" is called addition, and the operation "•" multiplication, even if they are not the standard operations on one of the known rings of numbers.
In the literature, there are structures without the assumption of having the identity for multiplication. These are not discussed here, so it is always assumed that a ring has an identity denoted by 1. The identity for addition is denoted by 0.
Fields
A non-trivial ring where all non-zero elements are invertible with respect to multiplication is called a division ring. If the multiplication is commutative, it is called a field.
Typical examples of fields are the rational numbers Q, the real numbers R, and the complex numbers C. Furthermore, every remainder class set Zp is a commutative ring, while only Zp for prime p are also fields.
Recall the useful example of a non-commutative ring, the set Matfc(K) of all k-by-k matrices over a ring K, k > 2. As can be checked for K = Z2 and k = 2, these rings are never an integral domain (see 2.1.5 on page 75 for the full argument).
812
CHAPTER 12. ALGEBRAIC STRUCTURES
Then, y = t2(t — 1), or for x = 0, the only satisfying point on the curve is y = 0. The point [0,0] can be obtained by choosing t = 1 in the mentioned parametrization, so it suffices to consider only this parametrization. □
We obtain more curves if we consider quotients of polynomials in the parametrization, i. e., / ■ we talk about a rational parametrization
h'
g = a±. Then,
a 92 '
12.D.6. Derive the parametrization of a circle using stereo-graphic projection (see the picture).
Solution. Substituting the equation of the line y = f +1 into the equation of the circle, we get the equation
x2 2x
with solution x = 0 or the parametric expression
x2 +
-+
1,
2t
V :
t2
1
□
l+i2'a     1 + i2' which does not include the point [0,1], however.
Remark. Note that in this case, the inclusion of the real line gives only "almost all points" of the parametrized variety, since one of them (i. e., the point from which we project) is not reachable for any value of the parameter t. This is not our fault - it follows from different topological properties of the line and the circle that there exists no global parametrization.
Remark. Since R is not an algebraically closed field, we have problems with existence of roots of polynomials. As a result, a mere perturbation of coefficients of the denning equation may drastically change the resulting variety. It is possible to work with complex polynomials C[x, y] and with the subsets they define in C2. We need not be scared by that; on the contrary, our originally real curves are contained in their "complexifications" (real polynomials are simply viewed as complex which happen to have real coefficients), and we just
As an example of a division ring which is not a field, consider the ring of quaternions H. This is constructed as an extension of the complex numbers, by adding another imaginary unit./, i.e. H = C ffi jC ~ R4, just as the complex numbers are obtained from the reals. Another "new" element i j is usually denoted k. It follows from the construction that ij = —ji. This structure is a division ring. Think out the details as a not completely easy exercise!
12.2.2. Elementary properties of rings. The following ■ lemma collects properties which all seem
to be obvious for rings of scalars. But the properties need proof to build an abstract theory:
Lemma. In every commutative ring K, the following holds
(1) 0- c= c-0 = 0forallceK,
(2) -c= (-1) ■ c= c- (-1) for all c e K,
(3) -(c ■ d) = (-c) -d = c- (-d) for all c,deK,
(4) a ■ (b — c) = a ■ b — a ■ c,
(5) the entire ring K collapses to the trivial set {0} = {1} if and only ifO = 1.
Proof. All of the propositions are direct consequences of the definition axioms. In the first case, for any c, a
c-a = c-(a + 0) = c- a + c-0,
and since 0 is the only element that is neutral with respect to addition, c • 0 = 0.
In the second case, it suffices to compute
0 = c-0 = c-(l + (-1)) = c+c- (-1).
This means that c ■ (—1) is the inverse of c, as desired.
The following two propositions are direct consequences of the second proposition and the axioms. If the ring contains only one element, then 0 = 1. On the other hand, if 1 = 0, then for any c e K, necessarily c=l-c = 0- c = 0. □
12.2.3. Polynomials over rings. The definition of a commu-ysa;, tative ring uses precisely the properties that are if^|      expected for multiplication and addition. The
concept of polynomials can now be extended. A — polynomial is any expression that can be built from (known) constant elements of K and an (unknown) variable using finite number of additions and multiplications. Formally, polynomials are denned as follows:4
It is not by accident that the symbol K is used for the ring - you can imagine e.g. any of the number rings behind that.
813
CHAPTER 12. ALGEBRAIC STRUCTURES
obtain richer tools for description of their properties (imaginary tangent lines, etc.).
Moreover, we are missing "improper points". For instance, when parametrizing a circle, we can describe the missing points as the image of the only improper point of the real line, i. e. the point "at infinity". These problems can be best avoided by working in the so-called projective extension of the (real or complex) plane.
The projective extension is advantageous to use in various problems, we will also use its application when denning the group operation on the points of an elliptic curve (see J).
12.D.7. (The complex circle). Consider the sets of points
XE = V(z2 + z\ - e) C C2 for any e G R \ {0}. The corresponding real curves are
Polynomials
circle with radius y/e   e > 0,
0 e < 0.
xi = a£ ni
We will write Zj = Xj + iyj = Xj + y/—lyj■ Therefore, ae is given as a subset in R4 by a system of two real equations:
Re(z2 + z\ - e) = x\ + x\ - y\ - y22 - e = 0, Im(z2 + z%-e) = 2(xiyi + x2y2) = 0.
Thus, we can assume that Xe will be a "two-dimensional surface" in R4. We will try to imagine it as a surface in R3 in a suitable projection R4 —> R3. For this purpose, we choose the mapping
<p+ : (x1,x2,yi,y2) >-> x1,x2,
x\y2 - x2yx
Denote by V the subset of R4 which is given by our second equation, i. e.,
V = {(x1,x2,y1,y2); xiyi+x2y2 = 0, (x1,x2) / (0,0)}.
The restriction of ip+ to V is invertible, and its inverse is given by
vw uw
\Al2 + n2 ' \/u2 + V2
Now, note that
xiy2 - x2yi
2   i 2
■Vi +y2,
and hence it follows that
p+(VtlXe) = IT = {(u,v,w); u2+v2-w2- \e\ = 0}. Now, we can compose the constructed mappings
pe : a£ -4- V® > <f+ » R3\ {(0,0,0)} D He,
Definition. Let K be a commutative ring. A polynomial over K is a finite expression
k
f(x) = ^2aiX\
1=0
where a, G K, i = 0,1,... ,k are the coefficients of the polynomial. If ak / 0, then by definition, f(x) has degree k, written deg f = k. The zero polynomial is not assigned a degree. Polynomials of degree zero (called constant polynomials) are exactly the non-zero elements of K.
Polynomials f(x) and g(x) are equal if they have the same coefficients. The set of all polynomials over a ring K is denoted K[x].
Every polynomial defines a mapping / : K —> K by substituting the argument c for the variable x and evaluating the resulting expression, i.e.
/(c) = a0 + aic H-----h akck.
Note that constant polynomials define constant mappings in this manner.
A root of a polynomial f(x) is such an element c G K for which/(c) = 0 G K.
It may happen that different polynomials define the same mapping. For instance, the polynomial x2 + x G Z2 [x] defines the mapping which is constantly equal to zero. More generally, for every finite ring K = {a0,a1,... ,ak}, the polynomial f(x) = (x — ao)(x — a±)... (x — ak) defines the constant-zero mapping.
Polynomials f(x) = J2i aix% ^ g(x) = J2i bix1 can be added and multiplied in a natural way (just think to introduce again the structure of a ring and invoke the expected distributivity of multiplication over addition):
(/ + g){x) = (a0 + b0) + (ai + h)x + ■ ■ ■ + (ak + bk)xk,
(f ■ g)(x) = (aobo) + (aoh + a^x + ...
+ (aobr + ai&r_i + arbo)xr + ■ ■ ■ + akb(Xk+e,
where k > £ are the degrees of / and g, respectively. Zero coefficients are assumed everywhere where there is no coefficient in the original expression.5
This definition corresponds to the addition and multiplication of the function values of /, g : K —> K, by the properties of "coefficients" in the original ring K.
It follows directly from the definition that the set of polynomials K[x] over a commutative ring K is again a commutative ring, where the multiplicative identity is the element 1 G K, perceived as a polynomial of degree zero. The additive identity is the zero polynomial. You should check all the axioms carefully!
To avoid this formal hassle, a polynomial can be defined as an infinite expression (like a formal power series over the ring in question) with the condition that only finitely many coefficients are non-zero.
814
CHAPTER 12. ALGEBRAIC STRUCTURES
and for every e > 0, we get a bijection pe : XE —> HE. The real part of this variety is the "thinnest circle" on the one-part rotational hyperboloid He; see the picture.
For e < 0, we can repeat the above reasoning, merely interchanging x and y and the signs in the definition of p+:
ip- : (x1,x2,yi,y2) i-> -yi,-V2,
-yi%2 + y2%i Vvi + ví
which changes the inversion ip.
/ nil
ifj+ : (u,v,w) ^ -
\/u2 + v2 ' \/u2 + V2 ' Now, He is again a one-part rotational hyperboloid, but its real part is X| = 0.
In the complex case, we can observe that when continuously changing the coefficients, the resulting variety changes only a bit, except for certain "catastrophic" points, where a qualitative leap may occur. This is called the principle of permanence. In the real case, this principle does not hold at all.
12.D.8. The projective extension of the line and the plane.
The real projective space Pi (R) is defined as the set of all directions in R2, i. e., its points are one-dimensional subspaces of R2.
The complex projective space Pi (C) is defined as the set of all directions in C2, i. e., its points are one-dimensional subspaces of C2.
Lemma. A polynomial ring over an integral domain is again an integral domain.
Proof. The task is to show that K[x] can contain non-trivial divisors of zero only if they are in K. However, this is clear from the expression for polynomial multiplication. If f(x) and g(x) are polynomials of degree k and I as above, then the coefficient at xk+e in the product f(x) ■ g(x) is the product cik ■ be, which is non-zero unless there are zero divisors in K. □
12.2.4. Multivariate polynomials. Some objects can be described using polynomials with more variables. For instance, consider a circle in the plane R2 =!§£!Jl5 whose center is at S = (x0,y0) and whose radius is R. This circle can be defined by the equation
(x - x0)2 + (y- y0)2 -R2 = 0.
Rings of polynomials in variables x\,...,xr can be defined similarly as in the case of K[x]. Instead of the powers xh of a single variable x, consider the monomials
rkl ■ ■ ■ tk
and their formal linear combinations with coefficients
ak1---kr G K.
However, it is simpler, both formally and technically, to define them inductively by
K[x1,...,xr] := (K[xu ... ,av_i])[av].
For instance, K[x, y] = K[x] [y]. One can consider polynomials in the variable y over the ring K[x]. It can be shown (check this in detail!) that polynomials in variables x1,... ,xr can be viewed, even with this definition, as expressions created from the variables x1,... ,xn and the elements of the ring K with a finite number of (formal) addition and multiplication in a commutative ring. For example, the elements of K[x, y] are of the form
/ = an(x)yn + a„_i(a;)y"
+
+ •
+ a0n)yn + ■
■ + a0(x) = + (bp0xp + ■■■ + b00)
■ coo + ciox + coxy + c20x2 + cxlxy + c02y2 + .
To simplify the notation, we use the multi-index notation (as we did with real polynomials and partial derivates in infinitesimal analysis).
815
CHAPTER 12. ALGEBRAIC STRUCTURES
Similarly, the points of the real and complex two-dimensional projective spaces are defined as directions in R3 and C3, respectively.
E. Algebraic structures
First of all, we practice general properties of operations and we find out what structures the known sets and operations actually are.
12.E.1. Decide about the following sets and operations what algebraic structures they are (groupoid, semigroup (with potential one-sided neutral elements), monoid, group):
i) the set of all subsets of the integers with union,
ii) the set of positive integers with the greatest common divisor as the binary operation,
iii) the set of positive integers with the least common multiple as the binary operation,
iv) the set of all 2-by-2 invertible matrices over R with addition,
v) the set of all 2-by-2 matrices over R with multiplication,
vi) the set of all 2-by-2 matrices over R with subtraction,
vii) the set of all 2-by-2 invertible matrices over Z2 with multiplication,
viii) the set Z6 with multiplication (modulo 6), ix) the set Z7 with multiplication (modulo 7).
Construct the table of the operation for the last-but-two structure.
Solution.
i) a monoid (the empty set being neutral),
ii) a semigroup (with no neutral elements),
iii) a monoid (1 being neutral),
iv) not even a groupoid (consider A+(—A) for an invertible matrix A),
v) a monoid,
vi) a groupoid (not associative),
vii) a group,
viii) a monoid (the class [1] being neutral), ix) a monoid (the class [1] being neutral).
The group in vii) consists of the following elements:
Multi-indices
A = D = i
1 0 0 1
1 0
, B =
E
0 1
1 1
0 1
1 0
F
, C =
1 0
1 1
The table
A multi-index a of length r is an r-tuple of non-negative integers (qi, ..., ar). The integer \a\ = ol\ + ■ ■ ■ + ar is called the size of the multi-index a.
Monomials are written shortly as xa instead of x^x^2 . . . x"r. Polynomials in r variables can be symbolically expressed in a similar way as univariate polynomials:
/ =   E  a"2;Q' S =   E   bl3X'3 e K^l' • • • 'Xr]-
a|<n |/3|<m
/ is said to have total degree n if at least one coefficient with multi-indices a of size n is non-zero, while all the coefficients with multi-indices of larger sizes vanish.
Analogous formulae can be defined for addition and multiplication of multivariate polynomials of degrees m and n respectively:
f + g=      E      (aa + ba)xa,
|a|<max(?n,n) 7|=0 V+,3=7
where the multi-indices are added componentwise, and the formally non-existing coefficients are assumed to be zero.
Lemma. These formulae describe addition and multiplication in the inductively defined ring of polynomials in r variables.
Proof. The proposition is easily proved by induction ■ on the number of variables. Suppose that the formulae are valid in K[x±,..., xr-{\, and calculate the sum
/ = ak(xi, ■ ■ ■, a^-i)^ H-----h a0(x1,xr-{)
g = bi(x1,... ,xr_i)a;J. H-----h b0(x1,. • • ,xr-i)
f + g = (a0(x1,... ,2v_i) + b0(x1,... ,a;r_i))
+ {ai(xi, ■ ■ ■ ,a;r_i) + b1(x1,.. .,xr-{))xr + ...
7 J +   E(a°-7 + VtX^i, • • • ,27-1)
(7.j)
The proof for multiplication is similar (do it yourselves!). □
816
CHAPTER 12. ALGEBRAIC STRUCTURES
of the matrix multiplication looks as follows:
	A	B	C	D	E	F
A	A	B	c	D	E	F
B	B	A	E	F	C	D
C	C	D	A	B	F	E
D	D	C	F	E	A	B
E	E	F	B	A	D	C
F	F	E	D	C	B	A
Note that each row and column (disregarding the heading ones) contains each element exactly once (why is it so?). Thus, we do not have to calculate each product and instead we can play "sudoku" as soon as we have filled enough entries of the table. □
12.E.2.   Let X be a set and V(X) denote the set of all subsets of X. Decide whether the set V(X) together with each of the following operations forms a groupoid, semigroup, monoid, group and whether the operation is commutative:
i) set intersection,
ii) set union,
iii) set symmetric difference (xor).
Solution. If the set X is empty, then V(X) together with any of the mentioned operations is the trivial (1-element) group. Otherwise:
i) with the set intersection, the resulting structure is a commutative monoid,
ii) with the set union, the resulting structure is a commutative monoid,
iii) with the set xor, the resulting structure is a commutative group where the empty set is neutral and each element is self-inverse. A-1 = A. ^
12.E.3. Decide about the following sets and operations what algebraic structures they are (groupoid, semigroup, group), whether they have one-sided or two-sided neutral elements, and whether the operation is commutative:
i) the set of all 3-by-3 invertible matrices over R with addition,
ii) the set of all 3-by-3 matrices over R with multiplication,
iii) the set of all 3-by-3 matrices over R with addition,
iv) the set of all 3-by-3 invertible matrices over Z2 with multiplication,
v) (Z9,+),
vi) (Z9,-).
o
The definition and the above results for polynomials over general commutative rings yield the following corollary:
Corollary. If a ring K is an integral domain, then the ring also an integral domain.
Proof. Proceed by induction on the number r of variables.6 Univariate polynomials are of the form / = anxn + ■ ■ ■ + a0 and g = bmxm + ■ ■ ■ + b0, where bm =^ 0 and an =^ 0. The leading term of the product fg is anbmxn+rn, since anbm =^ 0. In particular, the product of non-zero polynomials is again non-zero.
If the proposition holds for r— 1 variables, then the above calculations can be used for the ring of univariate polynomials in xr with coefficients from K[x1,..., xr_i]. □
12.2.5. Divisibility and irreducibility. The next goal is to understand how polynomials over a general inte-J^f    gral domain can be expressed as products of sim--i. '^Sk^T P^er polynomials- In the case of univariate poly-" nomials, this means finding the roots of a polynomial. Since multivariate polynomials can be denned inductively, it suffices to consider univariate polynomials over a general integral domain. This leads to a generalization of the concept of divisibility which forms the basis of elementary number theory in chapter eleven.
Consider an integral domain K (for instance, the integers Z or the ring Zp for prime p).
Divisibility in rings
a e K divides c e K if and only if there is some b e K such that a-b = c. This is written a\c.
The divisors of 1 (the multiplicative identity), i.e. the invertible elements of K, are called units.
The units of a commutative ring always form a commutative group in the sense used for the properties of addition in the definition of rings. This is called the group of units in K. The group of units inZis {—1,1}, while all nonzero elements in a field are units there.
In an integral domain, the divisors are determined uniquely. If b = a ■ c and b ^ 0, then c is determined by the choice of a and b, since if b = ac = ad, then 0 = a ■ (c — c') and being in an integral domain, a/0 implies c = c'.
Just as for integers, the following propositions are direct corollaries of the definitions (check the details yourself!):
Lemma. Let a, b, c e K. Then,
(1) if a\b and b\c, then a\c;
(2) if a\b and a\c, then a\(ab + (3c) for alia, (3 G K;
(3) a\0 (since a ■ 0 = 0);
(4) a £lis divisible by each unit esK and its a-multiple a ■ e (this follows from the existence of e~x).
Alternatively, proceed directly using multi-index formulae for the product, provided an appropriate ordering on monomials is defined, see the last part of this chapter.
817
CHAPTER 12. ALGEBRAIC STRUCTURES
12.E.4. Decide about the following subsets G of the complex numbers what algebraic structures they form together with multiplication (groupoid, semigroup, group), and whether the operation is commutative:
i) G = {a + bi | a, b G Z},
ii) G = {a + bi | a, b G R, a2 + b2 = 1},
iii) G = {a + b ■ ^5 | a, b G Q, a2 + b2 ^ 0}.
o
12.E.5. Decide whether Z together with the operation (vl forms a groupoid, semigroup, monoid, group, and whether V is commutative, provided it is defined by:
i)	aVb =	(a, b),
ii)	a<Z>b =	
iii)	a<Z>b =	2a+ b,
iv)	a<Z>b =	H>
v)	a<Z>b =	a + b + a ■ b,
vi)	a<Z>b =	a + b — a ■ b,
vii)	a<Z>b =	a+(-l)ab.
o
12.E.6. In how many ways can we fill the following table so that ({a, b, c}, *) would be a
i) groupoid
ii) commutative groupoid
iii) a groupoid with a neutral element
iv) a monoid
v) a group
*	a	b	c
a	c	b	a
b			b
c			
Solution.
i) 35
ii) 9
iii) 9
iv) 1
v) 0
Unique factorization domain
An element a G K is said to be irreducible if and only if it is divisible only by units e G K and their a-multiples.
A ring K is called a unique factorization domain if and only if the following conditions hold:
• For every non-zero element aeK, there are irreducible elements ai,... ,ar G K such that a = ai ■ a2 ... ar. This is called factorization of a.
• If there are two factorizations of a into irreducible non-unit elements a = aia2 ... ar = b\b2 ■ ■ ■ bs, thenr = s and, up to a permutation of the order of the factors bj, aj = ejbj for suitable units ej, j = 1,..., r.
Z is a unique factorization domain. So is every field, since every non-zero element is a unit.
There are examples of integral domains without the unique factorization property. The construction is similar to polynomials. Instead of powers, consider conveniently combined roots of powers of the unknown variable x. An integral domain K consists of finite expressions of the form
a0
where a®,..., ak G Z, iuj, n G Z>o. The multiplication and addition is defined as with polynomials assuming the standard behaviour of rational powers of x. Then, the only units in K are ±1, and all elements with a0 = 0 are reducible, but the expression x, for example, cannot be expressed as a product of irreducible elements. There are simply very few irreducible elements in K.
12.2.6. Euclidean division and roots of polynomials. The
,, fundamental tool for the discussion of divisibility, common divisors, etc. in the ring of integers Z is the procedure of division with remainder, and the Euclidean algorithm for the greatest common divisor. These procedures can be generalized.
Consider univariate polynomials akxk H-----h a0 over a
general integral domain K. akxk is called the leading monomial, while ak is the leading coefficient.
Lemma (An algorithm for division with remainder). Let K
be an integral domain and f g G K[x] polynomials, g 0. Then, there exists an a G K, a 0, and polynomials q and r such that af = qg + r, where either r = 0 or deg r < deg g. Moreover, ifK is a field or if the leading coefficient of g is one, and if the choice a = 1 is made, then the polynomials q and r are unique.
Proof. If deg / < deg g or / = 0, then the choice a = 1, q = 0, r = f, satisfies all the conditions. If g is constant, set a = g, q = /, r = 0. Continue by induction on the degree of /. Suppose deg / > deg g > 0, and write
□
f = a0-\-----h anxn,    g = b0-\-----h b„
818
CHAPTER 12. ALGEBRAIC STRUCTURES
12.E.7. Find the number of groupoids on a given three-element set.
Solution. Since the set is given, it remains to define the binary operation. In a groupoid, there is no restriction except that the result of the operation must be an element of the underlying set. Thus, for any pair of elements, there are three possibilities for the result. By the product rule, this gives
19683
groupoids.
□
12.E.8. Decide whether the set G = (R \ {0} x R) together with the operation A denned by (x, y)A(u, v) = (xu,xv+y) for all (x, y), (u, v) G G is a groupoid, semigroup, monoid, group, and whether A is commutative. O
F. Groups
We begin with recalling permutations and their properties. We have already met permutations in chapter two, see ??, where we used them to define the determinant of a matrix.
12.F.1. For each of the following conditions, find all permutations 7r G §7 which satisfy it:
i) = (1,2,3,4,5,6,7)
ii) tt2 = (l,2,3)o(4,5,6)
iii) tt2 = (1,2,3,4)
O
12.F.2. Find the signature (parity) of each of the following permutations:
'\   2   3   4   5   6 ...   3n-2 3n -1 3n
v2   3   1   5   6   4 ...   3n - 1      3n      3n - 2
'1   2   3   ...    n n + 1   n + 2 ... 2ti
,246...   2n 1         3 ... 2n-l
i)
ii)
Solution. The parity of a permutation corresponds to the number of transpositions from which it is built or, equiva-lently, to the number of its inversions, see 2.2.2. The number of inverses can be read easily from the two-row representation of the permutation. For each number of the second row, we count the number of numbers that are less and lie more to the right than the current number. Thus, the first permutation is even (the signature is 1), and in the second case, the signature depends on n and is equal to (—1)  *2+ '. □
Either bmf - anxn-mg = 0 or deg(bm/ - anxn-mg) < deg /. In the former case, the proof is finished. In the latter case, it follows by the induction hypothesis that there exist a',q', r' satisfying
a' (bmf - anxn-mg) = q'g + r'
and either r' = 0 or deg r' < deg g. This means that
a'bmf= [q' + a'anxn-m)g + r'.
If bm = 1 or K is a field, then the induction hypothesis can be used to choose a1 = 1 and then q', r' are unique. In this case,
bmf = (q'+ anxn~m) g + r'.
If K is a field, then this equation can be multiplied by b~1.
Assume that there is another solution / = qig+ri. Then, 0 = / — / = (q — qi)g + (r — r{) and either r = r\, or deg(r — r-f) < deg g. In the former case, it follows that q = qi as well, since there are no zero divisors in K[x]. Let axs be the term of the highest degree in q — qi ^ 0 (it must exist). Then, its product with the term of the highest degree in g must be zero (since the term of the highest degree is just the product of the terms of the highest degrees). However, this means that a = 0. Since axs is the non-zero term with the highest degree, q — qi contains no non-zero monomials, so it equals zero. But then r = r 1. □
The procedure for the Euclidean division can be used to discuss the roots of a polynomial.
Consider a polynomial / G K[x], deg/ > 0, and divide it by the polynomial x — b, b e K. Since the leading coefficient is one, the algorithm produces a unique result. It follows that there are unique polynomials q and r which satisfy / = q(x — b) + r, where r = 0 or deg r = 0, i.e. r e K. This means that the value of the polynomial / at b e K equals f(b) = r. It follows that the element b e K is a root of the polynomial / if and only if (x — b)\f. Since division by a polynomial of degree one decreases the degree of the original polynomial by at least one, the following proposition is proved:
Corollary. Every polynomial f G K[x] has at most deg / roots. In particular, polynomials over an infinite integral domain define the same mapping K —> K if and only if they are the same polynomial.
If two polynomials over an integral domain define the same mapping K —> K, then their difference has any element of K as a root. This means that if their difference is not the zero polynomial, then K has at most as many elements as the maximum of the degrees of the polynomials in question.
12.2.7. Multiple roots and derivatives. We shall now work over infinite integral domains K and so we may _ identify the algebraic expressions for the polynomials with the mappings.
The differentiation of polynomials over real or complex numbers is an algebraic operation which make sense for all K and it still satisfies the Leibniz rule:
819
CHAPTER 12. ALGEBRAIC STRUCTURES
12.F.3.   Find all permutations p G §g such that
[/9o(l,2,3)]2o[/9o(2,3,4)]2 = (l,2,3,4).
Solution. No such permutation exists, since the left-hand side is always an even permutation, while the right-hand side is an odd one. □
Derivative of polynomials
Consider polynomials f(x) = a0 + a\x + ■ ■ ■ + anxn,
g(x) = b0 +bix----Ybrnxra be polynomials of orders n and
m over an commutative ring K. The derivative ' : f(x) i->
f'(x) = a1 + a2x-\-----YncinX71-1 respects the addition of
polynomials and their multiplication by the elements in K. Moreover it satisfies the Leibniz rule
(1) (f(x)g(x)y = f(x)g(x)+f(x)g'(x).
12.F.4. Find all permutations p G §9 such that
p2o(l, 2) op2 = (1,2) op2 o(l, 2).
While claim on the additive structure is obvious, let us check the Leibniz rule:
o
12.F.5.   Consider the permutation a
1   2   3   4   5 6
v3   6   5   7   1 2 Find the order of a in the group (§7, o), the inverse of a and
compute a2013. Show that a does not commute with the
transposition r = (2,3).
Solution, a = (1, 3,5) o (2, 6) o (4,7). Therefore, the order of a is the least common multiple of the cycle lengths 3, 2, 2, which is 6. Furthermore, a~x = (1,5,3) o (2,6) o (4,7) and
ct2013 = (<j335)6 o tj3 = tj3 = (2, 6) o (4,7).
Finally, we have a o r = (1,3, 6, 2, 5) o (4, 7), but r o a =
(l,2,6,3,5)o(4,7). □
12.F.6.   Find a'1 and a2013, where
'1   2   3   4   5   6 7
(a) 1
4   5   7   6   1   2 3
in the symmetric
group (S7,o).
(b) a = [4] 11 in the group (Z^, ■
Solution, (a)tr = (1,4, 6, 2, 5)o(3, 7), a-1 = (l,5,2,6,4)o (3,7), since the order of (1,4,6,2,5) is 5 and the order of the transposition (3,7) is 2, we get that the order of a is the least common multiple of 2 and 5, which is 10, i. e., a10 = 1. Then,
a2013 = (^O)201 o tj3 = tj3 = (1,2,4,5, 6) o (3,7)
f(x) ■ g(x) = £
ckx
i+j=k
(b) For the sake of simplicity, we will write only k for the residue class [fc]n, k G Z. Then,
45 = 1   (mod 1)1 => a'1   =   44 = 3   (mod 1)1
^2013   =   420i3 = 43 = 9 (modl)1
□
and thus, expanding f'(x) ■ g(x) + j(x) ■ g'(x) yields exactly the expression for the derivative of the product
^Jk=o kckxk 1-
In particular, the derivative is not a homomorphism K[x] —> K[x] of the ring of polynomials, in view of (1). In much more general context, the homomorphisms of the additive structure of a ring satisfying the Leibniz rule are called derivatives. For polynomial rings, we see inductively that the only derivative there is our operation '.
Differentiation can be used for discussing multiple roots of polynomials.
Consider a polynomial j(x) G K[x] over infinite integral domain K, with root c G K of multiplicity k. Thus, in view of the division of polynomials discussed in the previous paragraph 12.2.6,
f(x) = (x-c)kg(x),
with the unique polynomial g, g(c) =^ 0. Differentiating j(x) and applying the Leibniz rule we obtain
f(x) = k(x-c)k-1g(x) + (x-c)kg'(x)
= (x-c)k-1{kg(x) + (x-c)g'(x)).
Clearly the polynomial h(x) = kg(x) + (x — c)g'(x) does not admit c as root, i.e. h(c) = kg(c) =^ 0. Thus we have arrived at the following very useful claim
Proposition. A polynomial f(x) over an infinite integral domain K admits the root c G K of multiplicity k if and inly if c is the root of f'{x) of multiplicity k — 1.
12.2.8. The fundamental theorem of algebra. While it may happen that a polynomial over the real numbers has no roots, every polynomial over the complex numbers has a root. This is the statement of the so c&lledfundamental theorem of algebra, which is presented here with an (almost) complete proof. By this result, every polynomial in C [x] has as many roots (including multiplicity) as its degree deg f = k. Hence it always admits a factorization of the form
f(x) = b(x - ai) ■ (x - a2) ... (x - ak)
for the complex roots and an appropriate leading coefficient b.
820
CHAPTER 12. ALGEBRAIC STRUCTURES
12.F.7. Prove that every group whose number of elements is even contains a non-trivial (i. e., different from the identity) element which is self-inverse.
Solution. Since each element which is not self-inverse can be paired with its inverse, we see that there are an even number of elements which are not-self inverse. Thus, there remain an even number of elements which are self-inverse, and one of them is the identity, so there must be at least one more such element. □
12.F.8. Prove that there exists no non-commutative group of order 4.
Solution. By Lagrange's theorem (see 12.3.10), the non-trivial elements of a 4-element group are of order 2 or 4. If there is an element of order 4, then the group is cyclic, and thus commutative. So the only remaining case is that there are (besides the identity e) three elements of order 2, call them a, b, c. We are going to show that we must have ab = c: It cannot be that ab = e, since the inverse of a is a itself, and not b. It cannot be that ab = a, since this would mean that b = e, and similarly, it cannot be that ab = b, since this would mean that a = e. Therefore, the only remaining possibility is that indeed ab = c, and it can be shown analogously that the product of any two non-trivial elements, regardless of the order, must be equal to the third one, so this group is commutative, too. Altogether, we have shown that there are exactly two groups of order 4, up to isomorphism. The latter is called the Klein group, and one instance of it is the group Z2 x Z2.
□
12.F.9. Show that there exists no non-commutative group of order 5.
Solution. By Lagrange's theorem (see 12.3.10), the non-trivial elements of a 5-element group are of order 5, so the group must be cyclic, and thus commutative. □
Remark. The same argumentation show that each group of prime order must be cyclic, and thus commutative. In particular, there are neither 2-element nor 3-element non-commutative groups. As we have shown (see 12.F.8), there is even no 4-element non-commutative group. Therefore, the smallest non-commutative group may be of order 6. As we have seen (see 12.E.l(vii)), this is indeed the case.
12.F.10. Prove that any group G where each element is self-inverse must be commutative.
Theorem. The field C is algebraically closed, i.e. every polynomial of degree at least one has a root.
Proof. Suppose that / G C [z] is a non-zero polynomial with no root, i.e. f(z) ^ 0 for all zeC, Con-•j, sider the mapping defined by
<p :
Then ip maps the entire C into the unit circle
Kx = {eu,t £ B} C C.
By the assumption that f(z) is never zero, this mapping is well-defined. Next, we shall consider the restrictions of p to the individual circles Kr C C with center at zero and radius
r > 0
We can parameterize these circles by the mappings
ipr:R^-Kr,    t ^ ip(t) = reu.
For all r, the composition k : (0,oo) x R —> K\, n(r, t) = p o tpr(t), is continuous in both t and r. Thus, for each r, there exists a mapping ar : R —> R which is uniquely given by the conditions 0 < ar(0) < 27r and n(r, t) = eJQr(*). Again, the obtained mapping ar continuously depends on r. Altogether, there is a continuous mapping
a : R x (0,oo)
(t,r) ^ ar(t).
It follows from its construction that for every r, 2^(ar(27r) — ar(0)) = nr G Z. Since a is continuous in r, it means that nr is an integer constant which is independent of r.
In order to complete the proof, it suffices to note that if f = ao + ■ ■ ■ + aazd and aa =^ 0, then for small values of r, ar behaves nearly as a constant mapping, while for large values of r, it behaves almost as if / = zd. First, calculate nr for / = zd, then make this statement more precise. This completes the proof.
t d
The complex functions z i-> 2 , z i-> can be expressed easily using the trigonometric form of the complex
821
CHAPTER 12. ALGEBRAIC STRUCTURES
Solution. Let a,b G G. Since each of ba, b, a is assumed to be self-inverse, we get
ab = ab((ba)(ba)) = a(bb)aba = (aa)ba = &a.
□
12.F.11.   Prove that every group G of order 6 is isomorphic
to Z6 or §3.
Solution.
By Lagrange's theorem (see 12.3.10), the non-trivial elements of a 6-element group are of order 2, 3, or 6. If there is an element of order 6, then G is cyclic, and thus isomorphic to Ze.
Therefore, assume from now on that the order of each non-trivial element is 2 or 3. Since an element a of order 3 is
not self-inverse (we have a 1 = a2 since a ■ a
2 - „3 _
1),
we get from exercise 12.F.7 that there must be at least one element of order 2.
As we are going to show, there must also be an element of order 3. For the sake of contradiction, assume that each element of G is self-inverse, and let a / b be any two elements different from the identity e. The same argumentation as in 12.F.8 shows that the product ab cannot be any of e, a, b. Thus, H = {e, a, b, ab} is a 4-element subset of G. Thanks to the self-inverseness, we can see that H is closed under the operation, with the possible exception of b ■ a, b ■ ab, and ab ■ a. However, we get from the above exercise that G is commutative, so that these 3 products also lie in H, and it follows that H is actually a subgroup of G. However, this contradicts theorem 12.3.10, by which a 6-element group cannot have a 4-element subgroup.
The only remaining case is that there is an element of order 2 (call it a) as well as an element of order 3 (call it b). Then, b2 is also of order 3 (and different from b), so G contains the 4 elements e, a, b, b2. Furthermore, G must also contain ab, ba, ab2, b2a, and by uniqueness of inverses, none of these is equal to e. Moreover, none of these may be equal to any of a, b, b2 (e. g. if we had a = ab, then multiplication by a-1 from the left yields e = b; the other equalities can be refuted similarly). Since G contains only 6 elements, the set {ab, ba, ab2, b2a} has at most two. Again, we can have neither ab = ab2 nor ba = b2a. If ab = ba, then (ab)2 = a2b2 = b2 =/ e and (ab)3 = a3b3 = a =/ e, so the order of ab is greater than 3, which contradicts our assumption. Therefore, it must be that ab = b2a and ba = a2b, so
numbers z = r (cos 6 + i sin t
= rd(cos d9 + i sin dO) = rd é' = l(cosd6» + i sin dO) = éda .
In this case, the mapping <p is simply a rotation of the complex plane, followed by the central projection onto the unit circle. Then, Kr(t) = eldt, and so ar(t) = dt, regardless of r.
It follows that nr = d for the choice /
If/
chosen, a / 0, then there is no impact on the above result (verify this yourselves!).
Consider a general polynomial / = ao + ■ ■ ■ + aazd with
no root. Then a0 =/ 0. (a0
z/0,
0, implies that 0 is a root). For
m
adzd Hence, limui.
1 + — (a0z ad
+ ■
/(*)
adzd
+ ad-iz ). 1. Knowing this, calculate
lim
/(*)
lim
\adzd\
f(z) adzd \adzd
adzu
adzd \adzd\
\adzu
0.
Hence, nr = d for large values of r.
A similar computation can be done for small values of r. Recall that a0 0:
ao a0
Thus, lini|z|. Hence, limi
,£2l
+ 0 1/(2
1. In addition,
= limu i ^n ě
+ adza
f(z)    = f(z) ao
ao   \aa\ |/(z)|
= 0 for small
values of r. Altogether, the degree of the polynomial is d = 0. □
12.2.9. The greatest common divisor. Consider a polynomial ring K[x] over an integral domain K. h is called the greatest common divisor of polynomials / and g e K[x] if and only if the following hold:
• h\f and h\g,
• for any k, if both k\f and k\g, then k\h.
As a direct corollary of the existence of an algorithm for unique division with remainder, there is the very important Bezout's identity (it is proved using the Euclidean division similarly as in the case of the integers in Chapter 11).
Theorem. Let K be a field and f,g G K[x]. Then, there exists a greatest common divisor h of the polynomials f and g. The polynomial h is unique up to a multiple by a non-zero scalar. In addition, there exist polynomials A,Be K[x] such that h = Af + Bg.
Proof. The polynomials h, A, B can be constructed directly using the Euclidean algorithm. Continue dividing with
822
CHAPTER 12. ALGEBRAIC STRUCTURES
that G is indeed isomorphic to §3 (a corresponds to a transposition and b does to a cycle of length 3). This group can also be viewed as the group of symmetries of an equilateral triangle (a corresponds to a reflection and b does to a rotation by 120°), see also 12.3.3.
We have discussed all possibilities, so the proof is finished. □
12.F.12. Find all commutative groups of order 8 (up to isomorphism). Then, for each of the following groups, decide to which of the found ones it is isomorphic (the operation is always multiplication):
• ZlX5.
• znx«.
Z1X7/{[1],[-1] = [16],-},
the complex roots of the polynomial zs
1.
Solution.
By theorem 12.3.8, every commutative group is a product of cyclic groups. By 12.3.10, their orders divide 8. This means that there are only 3 possibilities: Zs, Z2 x Z4, and Z2xZ2x Z2.
• The group Z*5 contains the residue classes which are co-prime to 15. There are y>(15) = (5 - 1)(3 - 1) = 8 of them, so indeed |Z*5| = 8. In particular, these are 1,2,4,7,8,11,13,14. Their orders are either 2 (for 4,11,14) or 4 (for 2,7, 8,13), which means that Zf5 is isomorphic to Z2 x Z4.
• Z*6 = {1,3,5,7,9,11,13,15}. Again, this group contains 8 elements, and their orders are either 2 (for 7,9,15) or 4 (for 3,5,11,13), which means that Z*6 is also isomorphic to Z2 x Z4,
• Zf7 = {±1,±2,...,±8}. Thus, the quotient Zf7/(±1) = {1,2,...,8} has 8 elements. We can easily calculate that the order of 3 is 8. Therefore, 3 generates the entire group, which means that Z?r/(±1) =• Z8.
• The complex roots of the polynomial zs — 1 are e^1, where n = 1,2,. ..,8. Clearly, these form a cyclic group of order 8, isomorphic to Z8.
□
12.F.13. Let G be a commutative group and denote H = {g G G I g2 = = e}, where e is the identity of G. Prove that H is a subgroup of G.
Solution. Clearly, e e H. If a G H, then we also have a-1 G H, because a = a-1 (since a2 = e). Moreover, if
remainder (since K is a field, there is always a unique way to do this; see the above lemma 12.2.6):
/ = qig + n,
g = Q2ri + r2, ri = q-zr2 + r3,
rP-i = qP+irp + 0.
In this procedure, the degrees of the polynomials r{ are strictly decreasing; hence the equality from the last line must occur (for some p), and this says that rp | rp_ 1. It follows from the line above that rp|rp_2 etc.. Continue sequentially up to the first and second lines, to obtain rp \g and rp\f.
If h\f and h\g, then the same equalities imply that h divides all the ri. In particular, it divides rp. In this way, the greatest common divisor h = rp of the polynomials / and g is obtained.
Substitute upwards, starting with the last equation:
h = rp = rp_2 - gprp_i = = rp-2 - qP(rp-3 - qp-\rp-2) =
= -qPrP-3 + (1 + qP-iqP)rP-2 =
= -qPrP-3 + (1 + qPqP-i)(rP-4 - gp_2rp_3) =
= Af + Bg. n
12.2.10. Fields of fractions. When dealing with integer calculations, it is often more advantageous to work with rational numbers and verify only at the end of the procedure that the result is an integer. This method is useful in the case of polynomials, too.
Let K be an integral domain. Its field of fractions is defined as the set of equivalence classes of the pairs (a, 6) G I x K, f) / 0, These classes are written f, and the equivalence is denned by
a       a' ui lu
- = — ab = ab.
b b'
Addition and multiplication is denned in terms of representatives:
ad + be
a c b + d
bd
b d bd'
It is easily verified that this definition is correct and that the resulting structure satisfies all field axioms. In particular, j is the additive identity, and \ is the multiplicative identity. If a/O.d^O, then f \ = \. All the details of the arguments are in fact identical with the discussion of rational numbers in 1.6.6.
The field of fractions of a ring K[x±,... ,xr] is called the field of rational functions (of r variables) and denoted
823
CHAPTER 12. ALGEBRAIC STRUCTURES
b e H, then (ab)2 = a2b2 = e (this is where we use the commutativity of G), which means that ab e H. Thus, H is closed under the operation, and it is indeed a subgroup. □
12.F.14. Let g£n(R) denote the set of all n-by-n regular matrices with real coefficients. Prove that G = gC2(R) with multiplication is a group and decide for each of the following subsets H of G whether it is a subgroup of G:
i) H = gc2(Q),
ii) H = QL2(1),
iii) H = {AeGC2{Z) I \A\ = 1},
iv) H = J r   J G G I a, b e Q
v) H =
vi) # =
vii) H =
viii) if =
12.F.15.
1 0
a 1
1 a
a 1
0 a & c
1 a c
e G I a e Z G G I a e Q £ G I a, 5, c £
^   ni e G \ a,b,c e
{a e
i) Decide whether the set H subgroup of the group (R*, ■)
ii) Decide whether the set ii = {a e subgroup of the group (R, +)
K(x1,..., xr). In software systems like Maple or Mathemat-ica, all algebraic operations with polynomials are performed in the corresponding field of fractions, i.e. in the field of rational functions, usually using K = Q.
Now follows a very useful (and elegant) statement, the proof of which is straightforward, yet it requires many technical details (and it concerns the field of rational functions). It is recommended to read the following paragraph carefully. Then maybe, at the first reading, skip the following three lemmas of the proof.
12.2.11. Theorem. Let K be a unique factorization domain. Then, K[x] is also a unique factorization domain.
Proof. The idea of the proof is very simple. Consider a polynomial / G K[x]. If / is reducible, then (N JiZ-S      f = f\ ■ f2, where neither of the polynomials hi h £ K[x] is a unit. Moreover, assume for a while that if / is divisible by an irreducible polynomial h, then so is f\ or f2.
If this is always the case, this procedure can be applied step by step to reach a unique factorization. If f\ is further reducible, then f\ = g\ ■ g2, where g±, g2 are not units, and either both the polynomials g\ and g2 have degree less than that of /, or the number of irreducible factors in the leading terms of g\ and g2 decreases (for instance, over the integers q z, 2x2 + 2x + 2 = 2(x2 + x + 1)). After a finite number of steps, a factorization / = f\... fr is obtained, where the polynomials f\,..., fr are irreducible.
It follows from the additional assumption that every iris a reducible polynomial h which divides / also divides one of /1,..., fr. Therefore, for every other factorization / = /1/2 • • • fs< eacn factor f divides one of /', and in this case, /' = efi for an appropriate unit e. Cancel such pairs step by step, to conclude that r = s and that the individual factors differ only by a unit multiple. □
is a
o
12.F.16. Find all positive integers m/5 such that the group 12^ is isomorphic to Z5 . O
12.F.17. How many cycles of length p (1 < p < n) are there in §„?
Solution. The elements of the cycle (i. e., the non-fixed points of the permutation) can be selected in (™) ways. Now, without loss of generality, we can proclaim one of the p elements to be the first element in the cycle representation (for instance the least one, if we are working with numbers). This element can be mapped to any of the p — 1 remaining elements, that one can be mapped to any of the p — 2 remaining elements, etc. Altogether, we get by the product rule that there are (p.{p_iy) cycles of length p. □
The direct consequence of the latter theorem for the multivariate polynomials can be formulated (due to their inductive definition):
Corollary. Let K be a unique factorization domain. Then, also a unique factorization domain.
Every polynomial over a unique factorization domain can be factored in a similar way to the case of polynomials with real or complex coefficients.
In particular, this holds for polynomials over every field of scalars.
12.2.12. Completion of the proof. It remains to prove that if i, a polynomial f = f\f2 is divisible by an irreducible polynomial h, then h divides either f\ or f2 or both. This statement is proved in the following three fS lemmas.
Lemma. Let Kbe a unique factorization domain. Then: (1)   If a, b,c £ K, a is irreducible and a\bc, then either a\b or a\c.
824
CHAPTER 12. ALGEBRAIC STRUCTURES
12.F.18. Let G be the set of real-valued matrices with zeros above the diagonal and ones on it. Prove that G with matrix multiplication forms a group, i. e., a subgroup in QC(3,R), and find the center of G (i. e., the subgroup defined by Z(G) = {z G G | Vg G G : zg = gz}).
Solution. We can either verify all the group axioms or make use of the known fact that QC (3, R) is a group, and we verify only that G is closed with respect to multiplication and inverses. Clearly, the neutral element (the identity matrix) lies inG.
'10 0
1 0 0>
a + ai 1 0
b + cai + bi   c + ci   11
G G.
It follows from the form of the products in G that the center contains precisely the matrices of the form
□
12.F.19.   For any subset X C G, we define its centralizer
as CG(X) = {y G G | xy = yx, for all x G X}. Prove that if X C Y, then CG(Y) C CG(X). Further, prove that
X C CG(CG(X)) and CG(X) = CG(CG(CG(X))).
Solution. The first proposition is clear: The elements of G which commute with everything from Y also commute with everything from X. We have from the definition that
CG(CG(X)) = {y G G | xy = yx,\/x G CG(X)}, and this is in particular satisfied by the elements y G X. The last statement follows simply using the two above. Substituting X := CG(X) into the second one, we get CG(X) C CG(CG(CG(X))), and applying the first one to the second one, we obtain CG(X) D CG(CG(CG(X))). □
12.F.20. Suppose that a group G has a non-trivial subgroup H which is contained in every non-trivial subgroup of G. Prove that H is contained in the center of G.
Solution. For each g G G, the centralizer CG (g) = {x G G | xg = gx} is a non-trivial subgroup, since g G CG(g) and CG (e) = G. Thus, the group H is contained in every CG(g). Therefore, it is contained in their intersection (over all g G G), which is exactly the center of G. □
(2) If a constant polynomial a G K[x] divides f G K[x], then a divides all coefficients of f.
(3) If a is an irreducible constant polynomial in K[x] and
a\f9< f, 9 £ K[x], then a\f or a\g.
Proof. (1) By the assumption, be = ad for a suitable d G K. Let d = di... dr, b = b\... bs, c = c\... cq be the factorizations to irreducible factors. This means that
adi... dr = b\... bsc\... cq.
Since ad factors in a unique way, it follows that a = ebj or a = ec{ for a suitable unit e.
(2) Let / = b0 + b\x + ■ ■ ■ + bnxn. Since a\f, there must exist a polynomial g = cq + c1x + ... ckxk such that
G,
ag. Hence it immediately follows that k = n, ac0
'o,..., acn — bn.
(3) Consider /, g G K[x] as above and suppose that a divides neither / nor g. By the previous claim, there exists an i such that a does not divide bi, and there exists a j such that a does not divide Cj. Choose the least such i and j. The coefficient at xl+i of the polynomial fg is + bi+jC0. By choice, a divides all bi_1cj+1, bi+1cj-1,..., bi+jc0. At the same time, it does not divide biCj. Therefore, it cannot divide the coefficient. □
b0ci+j + b1ci+j_1 + of b0ci+j,
12.2.13. Lemma. Consider the field of fractions L of a unique factorization domain K. If a polynomial f is irreducible in K[x], then it is irreducible in h[x], too.
Proof. Each coefficient a G K can be considered as an element f G L. Therefore, every non-zero polynomial / G K[x] can be considered a polynomial in h[x].
Suppose that / = g'h' for some g',h' G h[x], where the polynomials g', hi are not units in h[x] (i.e. they are not constant polynomials, since L is a field). Let a be a common multiple of the denominators of the coefficients in g' and b be a common multiple of the denominators of coefficients in h'. Then, bh',ag' G K[x], and so abf = (bh')(ag'). Let c be an irreducible factor in the factorization of ab. Then, c divides (bh!) (ag1), and hence c divides bh' or ag' (by the previous lemma). This means that c can be canceled out. After a finite number of such cancellations, the conclusion is that / = gh for polynomials g,h G K[x]. Since the degrees of the polynomials are not changed, neither g nor h is constant.
Thus if / is reducible in h[x], then it is also reducible in K[x], contradicting the implication to be proved. □
12.2.14. Lemma. Let K be a unique factorization domain and f,g,h G K[x}. Suppose that f is irreducible and f\gh. Then, either f\g or f\h.
Proof. This statement is already proved in one of the previous lemmas for the case that / is a constant polynomial (i.e. an element of K).
Suppose that deg / > 0. Then / is irreducible in h[x] as well, where L is the field of fractions of the ring K.
825
CHAPTER 12. ALGEBRAIC STRUCTURES
12.F.21. Let G be a finite group. The conjugation class for a G G is the set
Cl(a) = {xax-1 | x G G}.
Prove that:
i) the set of conjugation classes of all elements of G is a partition of G,
ii) the size of each conjugation class divides the order of G,
iii) if G has only two conjugation classes, then its order is 2.
Solution, (i) It suffices to show that we have for any a,b G G that either Cl(a) = Cl(b) or Cl(a) n CZ(&) = 0. Thus, assume that the intersection of Cl(a) and Cl(b) is non-empty. Then, by definition, there are x, y G G such that xax~x = yby~x. Multiplying this equality by y~x from the left and by y from the right leads to y~1xax~1y = b. However, (y~1x)~1 = x~xy, which means that b is of the form zaz~x for z = y~xx and thus lies in Cl(a). Analogously, we get a G Cl(b), so that both conjugation classes coincide.
(ii) Note that the elements of CI (a) are in one-to-one correspondence with the cosets corresponding to the centralizer Cc(a) = {x G G xax~x = a}. Indeed, if elements b and c lie in the same coset (i. e., they satisfy b = cz for some z G Cc(a)), then
bab-1 = C2a(c2)-1 = czaz~xc~x = czz~1ac~1 = cac-1.
By 10.2.1, we have\G\ = |CG(a)|-|G/CG(a)|, whichmeans that |O(a) = \G/CG(a)\ divides |G|.
(iii) The neutral element always forms its own conjugation class Cl(e) = {e}. Therefore, if there are only two conjugation classes, then all the other elements u/e must lie in one class. Thus, its size is \G\ — 1, and by (ii), this integer must divide \G\, which means that \G\ = 2. □
12.F.22. Let G be a commutative group. Suppose that the order r of an element a G G and the order s of an element b G G are coprime. Prove that the order of ab is rs.
Solution. We have (ab)rs = ars6rs = (ar)s(bs)r = eser = e, so the order is at most rs. For the sake of contradiction, assume that (ab)q = e for some q < rs. Since q is less than the least common multiple of r and s (recall that r, s are coprime), at least one of them does not divide q. Assume that it is r (the other case can be refuted analogously). Taking the s-th power of the equality (ab)q = e, we get e = ((ab)q)s = (ab)qs = aqsbqs = aqs{bs)q = aqseq = aqs. Since r does
Suppose that K itself is a field (and as such equals its field of fractions). Moreover, suppose that f\gh and / does not divide g. The greatest common divisor of the polynomials g and / must be a constant polynomial in K. Therefore, there are A, B e K[x] such that 1 = Af + Bg. Hence, h = Afh + Bgh. Since f\gh, it follows that f\h as well.
Return to the general case. It follows from the assumptions that /1 g or /1 h in the polynomial ring L [x] over the field of fractions L of the ring K. For instance, let h = kf in h[x], and choose an a G K so that ak G K[x]. Then, ah = akf and it must hold for every irreducible factor c of a that c\ak, because / is irreducible and not constant.
It follows that c can be canceled. After a finite number of such cancellations, a becomes a unit, i.e. h = k'f for an appropriate k! G K[x]. □
The proof of this lemma completes the proof of theorem 12.2.11.
3. Groups
As an illustration of the most abstract approach to an al-tsT geDraic theory, concepts enjoying just one oper-♦^§Jf|v ation only are considered. The focus is on ob-uM^*^::_ jects and situations where equations of the form a ■ x = b always have a unique solution (as usual with linear equations, the objects a and b are given, while x is what is sought for). This is group theory. Note that nothing is known about the "nature" of the objects, or even what the dot stands for. The only assumption is that two objects a and x are assigned an object a ■ x.
In a previous part of this chapter, such operations are known as addition or multiplication in rings. The concepts and vocabulary concerning such operations are now extended. Among them, numbers and transformations of the plane and space, where such "group" objects are met. Then follows the foundations of a general theory.
12.3.1. Examples and concepts. Let A be a set. A binary operation on A is defined to be any mapping A x A —> A. The result of such an operation is often denoted
(a, b) H> a ■ b
and called the product of a and b. A set together with a binary operation is called a groupoid or a magma.
Further assumed properties of the operations are needed in order to be able to say something interesting,
A binary operation is said to be associative, if and only
if
a ■ (b ■ c) = (a ■ b) ■ c
for all a,b,c G A.
Binary operations and semigroups
A groupoid where the operation is associative is called a semigroup.
A binary operation is said to be commutative if and only if a ■ b = b ■ a for all a, b G A.
826
CHAPTER 12. ALGEBRAIC STRUCTURES
not divide q and is coprime to s, we get that r (the order of a) does not divide qs, but aqs = e, which is a contradiction. □
12.F.23. Prove that every finite group G whose order is greater than 2 has a non-trivial automorphism.
Solution. If G is not commutative and a is an element that does not lie in the center, then the conjugation x i-> axa-1 defines a non-trivial automorphism. For a cyclic group of order m, we have, for any n coprime to m, the automorphism x i-> xn. If G is commutative, then it is a product of cyclic groups (see 10.1.8). If the order of at least one of the factors is greater than 2, then we can use the above automorphism for cyclic groups. If the order of each factor is 2, then permuting any pair of factors is a non-trivial automorphism. □
12.F.24. Consider the group (Q, +) of the rational numbers with addition and the group (Q+, •) of the positive rational numbers with multiplication. Find all homomorphisms
(Q,+)^(Q+,-)-
Solution. There is only one homomorphism, the trivial one. For the sake of contradiction, assume that there exists a non-trivial homomorphism p, i. e., p(a) = b =^ I for some a,b e Q. Then, for all n e N, we have b = p(a) = p(n^) = p{pi)n- This is a contradiction, since only some n-th roots of b axe rational (cf. l.G.l). □
12.F.25.   Let G be the group of matrices of the form
I ?    -1 I, where a, b e R and a > 0, and let N be the set
\b   a J
of matrices of the form ^   ^, where b e R. Show that N
is a normal subgroup of G and prove that G/N is isomorphic toR.
Solution. The key to the proof is the formula for multiplication in G:
(a0\(a10\_/'      aai 0 \
[b   a~1J\b1   aV1 J ~ [bax + a~1b1 a'1^1)'
Hence we can see that the mapping ^ a"1) ^ a *s a homomorphism with kernel N. Thus, N is a normal subgroup of G. Moreover, G/N is isomorphic to the multiplicative group R+, which is isomorphic to the additive group R.
□
The natural numbers N = {0,1,2,...} together with either addition or multiplication from a groupoid. These operations are both commutative and associative. The integers Z = {..., —2, —1, 0,1,2,... } form a groupoid with any of addition, subtraction, and multiplication. Subtraction is neither associative, for example
(5 - 3) - 2 = 0 ^ 5 - (3 - 2) = 4,
nor commutative, since a — & = —(& — a), which is in general different from b — a.
Neutral elements, inverses, and groups7
A left identity (or left neutral element) in a groupoid (A, •) is an element e e A such that e ■ a = a for all a e A. Similarly, e G A is a right identity (right neutral element) iff for all a e A, a-e = a. If e satisfies both these properties, it is called an identity (or a neutral element).
In a semigroup (A, •) with identity e, an element & is a left inverse of an element a iff b ■ a = e; it is a right inverse of a iff a ■ b = e. If b satisfies both these properties, it is called an inverse of a.
A monoid (M, ■) is a semigroup which has a neutral element. A group (G, ■) is a monoid where each element has an inverse.
A commutative semigroup is a semigroup where the operation is commutative, similarly for a commutative monoid or a commutative group. A commutative group is also often called an Abelian group.
Consider direct consequences of the definitions. A groupoid cannot have both a left identity and a different right identity (if it had, what would be their product equal to?). Thus, if a groupoid has a (two-sided) identity, then it is the only identity element, called the identity.
Similarly, in a monoid, an element x cannot have both a left inverse a and a different right inverse b, since if a ■ x = x ■ b = e, then also
a = a ■ (x ■ b) = (a ■ x) ■ b = b.
Note that associativity of the operation is needed here. It follows that if x has an inverse, then it is unique. It is usually denoted by a;-1.
As an example, consider again the subtraction on integers. This operation is not associative. There is a right identity (zero), i.e. a — 0 = a for any integer a, but it is not a left identity. There is no left identity for subtraction.
The integers are a semigroup with respect to either addition or multiplication. They form a group only with addition, since with respect to multiplication, only the integers ±1 have an inverse.
If (A, ■) is a group, then any subset B C A which is closed with respect to the restriction of ■ (i.e. a ■ b e B for any a,b e B) and forms a group with this operation is called
The name "Abelian" is in honour of a young mathematician Niels Hen-rik Abel. The adjective is so widely used that it is common to write it with a lower-case 'a', abelian, although it is derived from a surname.
827
CHAPTER 12. ALGEBRAIC STRUCTURES
12.F.26. Let G be a group of order 14 which has a normal subgroup N of order 2. Prove that G is commutative.
Solution. Clearly, the order of the group G/N is \G/N\ =
I Gl
jjyf- = 7. By Lagrange's theorem 12.3.10, the orders of its elements are 1 or 7. Since only the identity has order 1, this means that there is an element of order 7, so that the group G/N is cyclic. Let N = {e, n}, where e is the identity of G and let [a] be the generator of G/N. Since N is normal, we have ana-1 e N, but ana-1 = e implies n = e, which means that we must have ana-1 = n, i. e., na = an. Since [a] generates G/N, we get that each element of G/N is of the form [a]k, k = 0,..., 6, i. e., [ak]. Then, each element of G is of the form ak or akn, and since a and n commute, we get that actually all elements of G commute. □
12.F.27. Decide whether the following holds: If the quotient G/N of a group G by a normal subgroup N is commutative, then G itself is commutative. O
12.F.28. Prove that any subgroup H of the symmetric group §„ contains either only even permutations or the same number of even and odd permutations.
Solution. Consider the homomorphism p : H —> Z2 which maps each permutation to its parity (0 for even and 1 for odd). Then, p~x (0) = Ker(p) is a normal subgroup of H: let h e Ker(p), then
Pighg'1) = p(g)p(h)p(g~1) =p(g)p(g~1) = pigg'1) = = P(e) = 0,
which means that ghg~x e Ker(p), i. e., Ker(p) is normal. Since Z2 has only two elements, it follows that H/ Ker(p) has either only one coset (i. e., all permutations are even) or two cosets, which must be of equal size (i. e., there are the same number of even and odd permutations). □
12.F.29. Describe the group of symmetries of a regular tetrahedron and find all of its subgroups.
Solution. Let us denote the vertices of the tetrahedron as a, b, c, d. Each symmetry can be described as a permutation of the vertices (to which vertex each one goes). Thus, the group of symmetries of the tetrahedron is isomorphic to a certain subgroup of the symmetric group §4. Given any pair of vertices, there exists a symmetry which swaps this pair and keeps the other two vertices fixed (this is reflection with respect to the plane that is perpendicular to the line segment of the pair and
a subgroup. Both conditions are essential. For instance, consider the integers as a subset of the rational numbers and the multiplication there.
Let G be a group and McG. The subgroup generated by M is the smallest (with respect to set inclusion) subgroup of G which contains all the elements of M. Clearly, this is the intersection of all subgroups containing M.
Here are a few very well known examples of groups. The rational numbers Q are a commutative group with respect to addition. The integers are one S of their subgroups. The non-zero rational numbers are a commutative group.
For every positive integer k, the set of all fc-th roots of unity, i.e. the set {z e C; zk = 1} is a finite commutative group with respect to multiplication of complex numbers. For k = 2, this is the two-element group {—1,1}, both of whose elements are self-inverse. For k = 4, this is the group G =
{1, 1,-1,-l).
The set Mat„, n > 1, of all square matrices of order n is a (non-commutative) monoid with respect to multiplication and a commutative group with respect to addition (see subsections 2.1.2-2.1.5).
The set of all linear mappings Hom(VJ V) on a vector space is a monoid with respect to mapping composition and a commutative group with respect to addition (see subsection 2.3.12).
In every monoid, the subset of all invertible elements forms a group. In the former of the above examples, it was the group of invertible matrices. In the latter case, it was the group of linear transformations of the corresponding vector space.
In previous chapters, there are several (semi)group structures, sometimes met quite unexpectedly. For example, recall various subgroups of the group of matrices or the group structure on elliptic curves.
12.3.2. Permutation groups. Groups and semigroups often arise as sets of mappings on a fixed set M, which are closed with respect to mapping com-E'kut>- position.
This is easily seen on finite sets M, where every subset of invertible mappings generates a group with respect to composition.
Such a set M consisting of m = \M\ G N elements (the empty set has 0 elements) allows for ram possible mappings (each of m elements can be sent to arbitrary element of M), and all of these mappings can be composed. Since mapping composition is associative, it is a semigroup.
If a mapping a : M —> M is required to have an inverse a-1, then a must be a bijection. Composition of two bijections yields again a bijection; hence the set Em of all bi-jections on an m-element set M is a group. This is called the
828
CHAPTER 12. ALGEBRAIC STRUCTURES
goes through its center). Thus, the wanted subgroup is generated by all transpositions in §4. However, this is the group §4 itself.
Thus, let us describe all subgroups of the group §4. This group has 24 elements, which means that the order of any subgroup must be one of 1, 2, 3, 4, 6, 8,12, 24 (see 12.3.10). Clearly, the only group of order 1 is the trivial subgroup {id}. Similarly, the only group of order 24 is the entire group §4. Now, let us look at the remaining orders of a potential subgroup H C §4.
(i) \H\ =2. H must consist of the identity and another self-inverse element (x2 = id). These are transpositions and double transpositions (compositions of two disjoint transpositions). Geometrically, double transpositions correspond to rotation by 180° around the axis that goes through the centers of opposite edges). Thus, we get nine subgroups: {id, (a, 6)}, {id, (a, c)}, {id, (a, d)}, {id, (b, c)}, {id, (b, d)}, {id, (c, d)}, {id, (a, 6) o (c, d)}, {id, (a, c) o (6, d)}, {id, (a, d) o (6, c)}.
(ii) \H\ = 3. By Lagrange's theorem, such a subgroup must be cyclic, i. e., it must be of the form {id, p,p2}, p3 = id. Thus, the factorization of p to independent cycles must contain a cycle of length 3, which means that p cannot contain anything else. By 12.F.17, there are 4 ■ 2 cycles of length 3, which give rise to the following four subgroups: {id, (a, b, c), (a, c, 6)}, {id, (a, c, d), (a, d, c)}, {id, (a, b, d), (a, d, b)}, {id, (6, c, d), (b, d,c)}. The cycles of length 3 correspond to rotation by 120° around the axis that goes through a vertex and the center of the opposite side.
(iii) \H\ =4. Such a subgroup must be isomorphic to Z4 or Z2 x Z2. Considering the factorization to independent cycles, we find out that the only permutation of order 4 is a cycle of length 4. Thus, cyclic subgroups must contain a cycle of length 4, namely exactly two of them, since if p has order 4, then p~x = p3 is also of order 4 i. e., a cycle of length 4. Then, the permutation p2 has order 2, so it must be a double transposition (it is not a single transposition, since p2 clearly does not have a fixed point). There are six cycles of length 4 (see 12.F.17), and they pair up to the following three subgroups of this type:
{id, (a, b, c, d), (a, c) o (b, d), (a, d, c, b)}, {id, (a, c, b, d), (a, 6) o (b, d), (a, d, b, c)}, {id, (a, b, d, c), (a, d) o (be), (a, c, d,b)}. As for subgroups isomorphic to Z2 x Z2, they must contain (besides the identity) only elements of order 2, which
symmetric group (on m elements). It is an example of a finite group.8
The name of the group Em brings another connection: Instead of bijections on a finite set, permutations can be viewed as the rearranging of distinguished objects. Permutations are encountered in this sense when studying determinants, for example, see subsection 2.2.1 on page 84 for a few elementary results. Let us briefly recollect them now in view of the general concepts of groups and their homomorphisms.
What the operation in this group looks like needs more thought. In the case of a (small) finite group, build a complete table of the operation results for all pairs of operands. Considering the group £3 on numbers {1,2,3} and denoting the particular permutations by the ordering of the numbers
a = (1,2,3), 6= (2,3,1), c= (3,1,2),
d= (1,3,2), e=(3,2,l), / = (2,1,3),
then the composition is given by the following table:
	a	b	c	d	e	f
a	a	b	c	d	e	f
b	b	c	a	f	d	e
c	c	a	b	e	f	d
d	d	e	f	a	b	c
e	e	f	d	c	a	b
f	f	d	e	b	c	a
Note that there is a fundamental difference between the permutations a, b, c and the other three. The former three form a cycle, generated by either b or c:
b2 = c, b3 = a, c2 = b, c3 = a.
It follows that these three permutations form a commutative subgroup. Here (as well as in the whole group), a is the neutral element and b and c are inverses of each other. Therefore, this subgroup is the same as the group Z3 of residue classes modulo 3, or as the group of third roots of unity.
The other three permutations are self-inverse, which means that any one of them together with the identity a create a subgroup, the same one as Z2. b and c are elements of order 3, i.e. the third power is the first one equal to the identity a, while d, e, and / are of order 2.
Since the table is not symmetric with respect to the diagonal, the composition ■ is not commutative.
Other permutation groups Em of finite m-element sets behave similarly. Each permutation a partitions the set M into a disjoint union of maximal invariant subsets, which are obtained by taking unprocessed elements x e M step by step and putting all iteration results uk(x), k = 1,2,..., into the class Mx until ak(x) = x.
Each permutation is obtained as a composition of the cycles, which behave as the identity outside Mx and as a on Mx. If the elements of Mx are numbered as (1,2,..., \MX\) so that i corresponds to a1 (x), then the permutation is simply
It can be proved that every finite group is a subgroup of an appropriate finite symmetric group. This can be interpreted so that the groups Em are as non-commutative and complex as possible.
829
CHAPTER 12. ALGEBRAIC STRUCTURES
are transpositions and double transpositions. By 12.F.28, the subgroup must contain either no or exactly two transpositions. Moreover, it cannot cannot contain two dependent transpositions, since their composition is a cycle of length 3. Thus, the subgroup contains (besides the identity) either two independent transpositions and the double transposition which is their composition (this gives rise to three subgroups), or the three double transpositions. Altogether, we have found: {id, (a, 6), (a, 6) o (c, d), (c, d)}, {id, (a, c), (a, c) o (b,d),(b,d)}, {id,(a,d),(a,d) o (b,c),(b,c)} and {id, (a, 6) o (c, d), (a, c) o (b, d), (a, d) o (6, c)}.
(iv) \H\ = 6. By 12.F.11, this subgroup is isomorphic to §3 (it cannot be isomorphic to Z6 since there is no element of order 6 in §4), so it contains (besides the identity) two elements x, x~x of order 3 and three elements of order 2. Thus, x and a-1 are cycles of length 3 which fix the same vertex (say a). What are the other three elements? There cannot be a double transposition, since its composition with x yields another cycle of length 3. There cannot be a transposition which does not fix a since its composition with x yields a cycle of length 4. Thus, the only possibility is that there are the three transpositions which also fix a. Since there are four possibilities which vertex is the fixed one, we obtain four subgroups of order 6.
(v) \H\ = 8. The group cannot be a subgroup of the group A4 of even permutations (since there are 12 of them, and 8 does not divide 12). Thus, by 12.F.28, H must contain four even and four odd permutations. The even permutations must form a subgroup of A4, and we could see in (iii) that the only such 4-element subgroup is {id, (a, 6) o (c, d), (a, c) o (b, d), (a, d) o (b, c)}, which is normal. Considering any odd permutation and the coset (with respect to the above normal subgroup) which contains it, we can see that the coset together with the above 4 elements form a subgroup of §4. We thus get three subgroups of §4. It is not hard to realize that each of them is isomorphic to the group of symmetries of a square (the so-called dihedral group D4). From the geometrical point of view, we can describe it as follows: Consider the orthogonal projection of the tetrahedron onto the plane that is perpendicular to the line that goes through the centers of opposite edges. The boundary of this projection is a square. Out of all the symmetries of the tetrahedron, we take only those which induce a symmetry of this square (for instance, it will not be a symmetry which only swaps adjacent vertices of the
a one-place shift in the cycle (i.e. the last element is mapped back to the first one). Hence the name cycle. These cycles commute, so it does not matter in which order the permutation a is composed from them. (Of course, if we pick arbitrary two cycles on M, they do not have to commute.)
The simplest cycles are one-element fixed points of a and two-element subsets (x, cr(x)), where a(a(x)) = x. The latter are called transpositions. Since every cycle can be composed from transpositions of adjacent elements (just let the last element "bubble" back to the beginning), every permutation can be written as a composition of transpositions of adjacent elements.
Return to the case of £3. Two elements, b, c, represent cycles which include all the three elements; each of them generates {a, b, c} = Z3. Besides those, d, e, / are composed of cycles of length 2 and 1; finally a is composed of three cycles of length one. There are no more possibilities. However, it is clear from the procedure that for more elements, there are very many possibilities.
In general, there are many ways of expressing a permutation as a composition of transpositions. However, for a given permutation, the parity of the number of transpositions is fixed and independent of the choice of particular transpositions. This can be seen from the number of inverses of a permutation, since each transposition changes the number of inverses by an odd number (see the discussion in subsection 2.2.2 on page 85).
It follows that there is a well-defined mapping
sgn : Em -> Z2 = {±1},
the permutation parity. This recovers the proposition crucial for building the determinants (see 2.2.1 and on):
Theorem. Every permutation of a finite set can be written as a composition of cycles. A cycle of length £ can be expressed as a composition of £ — 1 transpositions. The parity of this cycle is (—1)£_1.
The parity of the composition got is equal to the product of the parities of the composed permutations a and r.
The last proposition suggests that the mapping sgn transforms permutation composition a o r to the product sgn a ■ sgn r in the commutative group Z2.
(Semi)group homomorphisms
In general, a mapping / : G\ —> G2 is a (semi)group homo-morphism if and only if it respects the operation, i.e.
/(a •&) = /(<*)•/(&).
In particular, the permutation parity is a homomorphism
sgn : Em -> Z2.
12.3.3. Symmetries of plane figures. In the fifth part of chapter one, the connections between invertible 2-by-2 matrices and linear transformations in the plane are thoroughly considered.
830
CHAPTER 12. ALGEBRAIC STRUCTURES
resulting square). Since there are three pairs of opposite edges in the tetrahedron, we get three 8-element subgroups, isomorphic to the dihedral group D4. (vi) \H\ = 12. By 12.F.28, such a subgroup contains either only even permutations, or six even and six odd permutations, and the six even permutations must form a subgroup of §4. However, we could see in (iv) that there is no 6-element subgroup of §4 consisting only of even permutations. Thus, the only possibility is the alternating group A4 of all even permutations in §4. From the geometric point of view, these are the so-called direct symmetries, which are realized by rotations (not reflections), and thus can be performed in the space. □
Remark. In general, the group of symmetries of a solid with n vertices is a subgroup of the symmetric group §„.
12.F.30.   Which subgroups of the group §4 are normal?
Solution. By definition, a subgroup H C §4 is normal iff it is closed under conjugation, i. e^ghg-1 e ii for any g e §4, h e H. Since conjugation in symmetric groups only renames the permuted items but preserves the permutation structure (i. e. the cycle lengths in the factorization to independent cycles), we can see that H is normal if and only if it contains either no or all permutations of each type. Examining all the subgroups, which we found in the previous exercise, we find that the normal ones are the trivial group {id}, the so-called Klein group (consisting of the identity and the three double transpositions, which we already met in 12.F.8), the alternating group A4 of all even permutations, and the entire group §4. □
12.F.31. Find the group of symmetries of a cube (describe all symmetries). Is this group commutative?
Solution. The group has 48 elements; 24 of them are generated by rotations (these are the so-called direct symmetries), the other 24 are the composition of a direct symmetry and a reflection. The group is not commutative (consider the composition of a reflection with respect to the plane containing the centers of four parallel edges and a rotation by 90° around the axis that lies in the plane and goes through the centers of two opposite sides. The group is isomorphic to §4 x Z2. □
12.F.32. In the group of symmetries of a cube, find the subgroup generated by a reflection with respect to the plane containing the centers of four parallel edges and the rotation by 180° around the axis that lies in the plane and goes through
A matrix in Mat2(R) defines a linear mapping R2 —> R2 that preserves standard distances if and only if its columns form an orthonormal basis of R2 (which is a simple condition for the matrix entries, see subsection fS 1 1.5.7 on page 33). It is easy to prove that every mapping of the plane into itself which preserves distances is affine Euclidean. That is, it is a composition of a linear mapping and an appropriate translation.9
As observed, the linear part of this mapping is orthogonal. Thus, all these mappings form a group of all orthogonal transformations (also called Euclidean transformations) in the plane. Moreover, it is shown that besides translations Ta by a vector a, these are only rotations around the origin by any angle ip, and reflection Fe with respect to any line that goes through the origin (also note that the central inversion is the same as the rotation by it).
Now, general group concepts are illustrated on the problem of symmetries of plane figures. For example, consider tiles. First, consider them individually, in the form of a bounded diagram in the plane. Then consider with the condition of tiling in a band, and then in the entire plane.
As an example, consider a line segment and an equilateral triangle. It is of interest in how much these objects are symmetric; that is, with respect to which transformations (that preserve distances) they are invariant. In other words, we want the image of the figure to be identical to the original one (unless some significant points are labeled , for example the vertices of the triangle A, B, C or the endpoints of the line segment). It is clear that all symmetries of a fixed object form a group (usually with only one element, the identity).
1
In the case of the line segment, the situation is very simple - it is clear that the only non-trivial symmetries are rotation by it around the center of the segment, reflection with
If a mapping F : M2 —► M2 preserves distances, then this must also hold for the mapped vectors of velocity, i.e. the Jacobi matrix DF(x, y) must be orthogonal at every point. Expanding this condition for the given mapping F — (f(x, y),g(x, y)) : R2 —¥ M2 leads to a system of differential equations which has only affine solutions, since all second derivatives of F must be zero (and then, the proposition is an immediate consequence of Taylor's remainder theorem). Try to think out the details! The same procedure leads to the result for Euclidean spaces of arbitrary dimension. Note that the condition to be proved is independent of the choice of affine coordinates. Composing F with a linear mapping does not change the result. Hence, for a fixed point (x,y), compose (DF)^1 o F and assume, without loss of generality, that DF(x, y) is the identity matrix. Differentiation of the equations then yields the desired proposition.
831
CHAPTER 12. ALGEBRAIC STRUCTURES
the centers of two opposite sides. Is this subgroup normal?
o
12.F.33. For each of the following permutations, decide whether the subgroup it generates is normal in the corresponding group:
. (1,2,3) in §3,
• (1,2,3,4) in §4,
• (l,2,3)inA4
For the last case, find the right cosets of A4 by the considered subgroup. Find all n > 3 for which the subset of all cycles of length n together with the identity is a subgroup of S„. Show that if this is so, then it is even a normal subgroup.
Solution.
• It is a normal subgroup of A3.
• It is not a normal subgroup ( (1,2) o (1,3) o (2,4) o (1,2) == (4,l)o(2,3)).
• It is not a normal subgroup. The right cosets are
{(l,2,4),(2,4,3),(l,3)o(2,4)}, {(l,4,2),(l,4,3),(l,4)o(2,3)}, {(2,3,4),(l,2)o(3,4),(l,3,4)}, {id, (1,2, 3), (1,3, 2)}.
The mentioned subset is a subgroup only for n = 3. In this case, it is the alternating group A3 of all even permutations in §3, which is a normal subgroup. (For greater value of n, we can find two cycles of length n whose composition is neither a cycle of length n nor the identity.) □
12.F.34. Find the subgroup of §6 that is generated by the permutations (1,2) o (3,4) o (5,6), (1,2,3,4), and (5, 6). Is this subgroup normal? If so, describe the set of (two-sided) cosets 5*6 /H.
Solution. First of all, note that all of the generating permutations lie in the subgroup §4 x 6*2 C Sq (considering the natural inclusion of §4 x §2, i- e., for s e §4 x §2, the restriction of s to {1,2,3,4} is a permutation on this set and so is the restriction of s to {5,6}). This means that the group they generate is also a subgroup of §4 x §2- Moreover, since there is (5,6) among the generators, we can see that the subgroup is of the form H x §2, where H C §4. Thus, it suffices to describe H. This group is generated by the elements (1, 2) o (3,4) and
respect to the axis of the segment, and reflection with respect to the line itself. All these symmetries are self-inverse. Hence the group of symmetries has four elements. Its table looks as follows:
	Ro	Rir	Fh	Fv
Ro	Ro	Rir	Fh	Fv
Rir	Rir	Ro	Fv	FH
FH	FH	Fv	Ro	Rir
Fv	Fv	FH	Rir	Ro
This group is commutative.
For the equilateral triangle, there are more symmetries: one can rotate by 27r/3 or one can mirror with respect to axes of the sides. In order to obtain the entire group, all compositions of these transformations must be added in. In 1.5.7 it is shown that the composition of two reflections is always a rotation. At the same time, it is clear that changing the order of composition of fixed two reflections leads to a rotation by the same angle but the other orientation. It follows that the reflections around two axes generate all the symmetries, of which there are six altogether. Placing the triangle as is shown in the diagram, the six transformations are given by the following matrices:
A comparison of the table of the operation, with that of the permutation group E3, shows that it is the same. For the sake of clarity, the vertices are labeled with numbers, so the corresponding permutations can be easily understood.
Similarly, there are groups of symmetries with k rotations and k reflections. It suffices to consider a regular k-gon. These groups are usually denoted Dk and are called the dihedral groups of order k. They are not commutative for k > 3 (D2 is commutative). The name comes from the fact that D2 is the group of symmetries of the hydrogen molecule H2, which contains two hydrogen atoms and can be imagined as a line segment.
Similarly, there are figures whose only symmetries are rotations, and hence the corresponding groups are commutative. They are denoted Ck and called cyclic groups of order k. For that, it suffices to consider a regular polygon whose sides are changed non-symmetrically, but in the same manner (see the extension of the triangle in the diagram). Note that the group C2 can be realized in two ways: either using the rotation by it or a single reflection.
As the first illustration of the power of abstraction, we prove the following theorem. A figure is said to have a discrete group of symmetries if and only if the set of images of an arbitrary point over all the symmetries is a discrete subset of the plane (i.e. each of its points has a neighbourhood where there is no other point of the set).
832
CHAPTER 12. ALGEBRAIC STRUCTURES
(1,2,3,4) (projection of the generators on §4). We have
(1,2,3,4)2 = (l,3)o(2,4),
(1,2,3,4)3 = (4,3,2,1),
(1,2,3,4)4 =id,
[(l,2)o(3,4)]2=id,
[(1,2) o(3, 4)] o(l, 2, 3,4) = (2,4),
(1,2, 3,4)o [(1,2) o(3, 4)] = (1,3),
[(1,2) o(3, 4)] o(4, 3, 2,1) = (1,3),
(4, 3, 2, l)o [(1,2) o(3, 4)] = (2,4),
[(1, 2) o(3, 4)] o [(1,3) o(2, 4)] = (1,4) o(2, 3),
[(1, 3) o(2, 4)] o [(1,2) o(3, 4)] = (1,4) o(2, 3),
[(1,2) o(3,4)] o(4, 2) = (1,2, 3,4),
(l,3)o(4,2) = (l,3)o(2,4).
Now, we can note that the generating permutations (1,2,3,4) and (1,2) o (3,4) are symmetries of a square on vertices 1,2,3,4. Therefore, they cannot generate more than 8-element D4, which has already happened. This means that no more permutations can be obtained by further compositions. Thus, the subgroup H C §4 has 8 elements (which is possible by Lagrange's theorem, since 8 divides 24).
H = { id, (1, 2, 3,4), (1, 3) o (2, 4), (4, 3, 2,1), (1, 2) o (3,4),
(1,3),(2,4),(1,4) o(2,3)}.
Altogether, the examined subgroup in §6 has 16 elements: for each h e H, it contains (ft., id) and (h, (56)). □
12.F.35. Find the subgroup in §4 that is generated by the permutations (1, 2) o (3,4), (1,2, 3). Solution. Since both the generating permutations are even, they can generate only even permutations. Thus, the examined group is a subgroup of the alternating group A4 of all even permutations. We have
[(l,2)o(3,4)]2 = id,
(1,2,3)2 = (3,2,1),
[(1,2) o(3, 4)] o(l, 2, 3) = (2,4,3),
(1,2, 3)o [(1,2) o(3,4)] = (1,3,4),
[(1,2) o(3, 4)] o(3, 2,1) = (3,1,4),
(3, 2, l)o [(1,2) o(3,4)] = (2, 3,4),
and now, we already have seven elements of the examined subgroup of A4, and since A4 has 12 elements and the order
Note that every discrete group of symmetries of a bounded figure is necessarily finite.
Theorem. Let M be a bounded set in the plane R2 with discrete group of symmetries G. Then, G is either trivial or one of the groups Ck, Ok for k > 1.
Proof. If there were a set M with translation as one of its symmetries, then it could not be bounded. If M had, as one of its symmetries, a rotation by angle which is an irrational multiple of 27r, then iterating this rotation would lead to a dense subset of images on the corresponding circle. It follows that the group is not discrete.
If M had non-trivial rotations with different centers as symmetries, then again it could not be bounded. To see this, write the corresponding rotations in the complex plane as
R : 2 1—> (z — a)(" + a,
z h> ZTj
for complex units £ = e
,2-Ki/k
V
e2m/e      m arbitrary
a ^ 0 G C. Then, it is immediate (a straightforward computation with complex numbers) that
Q0R0 Q'1 o R'1 : z^r z + a(-l + C + V-(v),
which is a translation by a non-trivial vector unless the angle of one of the rotations is zero. It follows that M is not bounded.
The same holds for the case of a rotation and a reflection with respect to a line which does not go through the center of the rotation. Check this case yourself!
The only symmetries available are rotations with a common center and reflections with respect to lines which pass through this center. It remains to prove that the entire group is composed either only from rotations, or from the same number of rotations and reflections.
Recall that the composition of two different reflections yields a rotation whose angle equals half the angle enclosed by the corresponding axes (see 1.5.7). Therefore, composing a reflection with respect to a line p with a rotation by angle ip/2 is again a reflection with respect to the line which is at angle ip from p (draw a diagram!).
The proof is almost complete. Observe that the subgroup of all rotations in the group of symmetries contains a rotation by the smallest non-trivial angle po (there are only finite many of them there). But then it is impossible to allow a rotation Rtp where ip is not a multiple of po (for then p e (kp0, (k + l)po) for some k and the composition R-kVo 0 Rv would have an smaller angle than po). This subgroup coincides with one of Ci. Next, adding one reflection produces exactly one different reflection for each nontrivial element in Ci, as seen above. □
12.3.4. Symmetries of plane tilings. There is more complicated behaviour in the case of plane figures in bands or in the entire plane (for example, symmetries of various tilings).
833
CHAPTER 12. ALGEBRAIC STRUCTURES
of a subgroup must divide that of the group, it is clear that the subgroup is the whole A4. □
12.F.36. Find all subgroups of the group of invertible 2-by-2 matrices over Z2 (with matrix multiplication). Is any of them normal?
Solution. In exercise 12.E.1, we built the table of the operation in this group. By Lagrange's theorem (12.3.10), the order of any subgroup must divide the order of the group, which is six. Thus, besides the trivial subgroup {^4} and the entire group, each subgroup must have two or three elements. In a 2-element subgroup, the non-trivial element must be self-inverse, which is also sufficient for the subset to be a subgroup. We thus get the subgroups {A, B}, {A, C}, {A, F}, which are not normal, as can be easily verified. The identity is A. Since B, C, F have order 2, they cannot be in a 3-element subgroup. Thus, the only remaining possibility is P = {^4, D, E], which is indeed a subgroup. Moreover, checking the conjugations BDB = E, CDC = E, FDF = E (whence it follows that BEB = D, CEC = D, FEF = D), we find out that this subgroup is normal. □
12.F.37.   Find all subgroups of the group (Z10, +).
Solution. The subgroups are isomorphic to (Z^, +), Where
d|10,i. e., {0} ^ Zi, {0,5} ^ Z2, {0,2,4,6,8} ^ Z5, and Zio. □
12.F.38.   Find the orders of the elements 2,4,5 in (Zg5, ■)
and in (Z35, +).
Solution. By definition, the order of x in the group (Zg5, •) is the least positive integer k such that xk = 1 mod 35. By Euler's theorem, the order of x = 2 and x = 4 is k < i^(35) = 24. Computing the corresponding modular powers, we find out that the order of x = 2 is 12. Hence it immediately follows that the order of x = 4 is 6. The number x = 5 does not lie in the group (Zg5, ■). Specifically, we have (modulo 35):
2   4    8    16   32   29   23   11   22    9 18 4   16   29   11   9 1
In the group (Z35, +), the order of x is the least positive integer k such that k ■ x = 0 mod 35. This can be calculated simply as k = (3|^ ■ Therefore, the order of 2 and 4 is 35, while the order of 5 is 7. □
Consider first the set containing the points that lie between two fixed parallel lines. Suppose that this band is covered with disjoint images of a bounded subset M by some translation. Of course, this translation is a symmetry of the chosen tiling of the band. So the group of symmetries is necessarily infinite.
Such a set allows for no other rotation symmetries than Rw, and the only possible reflections are either horizontal with respect to the axis of the band, or vertical with respect to any line which is perpendicular to the boundary lines. In addition, there are translations given by a vector which is parallel to the axis of the band.
A not-too-complicated discussion leads to description of all discrete groups of symmetries for these bands. Such a group is generated by some of the following symmetries: translation T, shifted reflection G (i.e. composition of horizontal reflection and translation), vertical reflection V, horizontal reflection H and rotation R by it.
Theorem. Every discrete group of symmetries of a band in the plane is isomorphic to one of the groups generated by the following symmetries:
(1) a single translation T,
(2) a single shifted reflection G,
(3) a single translation T and a vertical reflection V,
(4) a single translation T and the rotation R,
(5) a single shifted reflection G and the rotation R,
(6) a single translation T and the horizontal reflection H,
(7) a single translation T, the horizontal reflection H and a vertical reflection V.
The proof is not presented here. The following diagram shows examples of schematic patterns with corresponding symmetries:
D
4)
R	R	R	RR,	
				
R I -a		-a	R	
				
R	R &	R	R	R
				
R	R	R a	R	R 3
9 6)
		SR I
•as		■JlKJ
w	M	M	M M
o
It is even more complicated with symmetries of tilings awriich cover the entire plane. There is insufficient space here to consider further details. It can be shown that there are 17 such groups of symmetries, known as the two-dimensional crystallographic groups.
A similar complete discussion is known even for three-dimensional discrete groups of symmetries. The rich theory was created namely in the 19th century in connection with studying symmetries of crystals and molecules of chemical elements.
834
CHAPTER 12. ALGEBRAIC STRUCTURES
12.F.39. Find all finite subgroups of the group (R*, ■)1 Solution. If a given subgroup of the group (R*, ■) contains an element a, \a\ / 1, then the elements a, a2, a3,... form an infinite geometric progression of pairwise distinct elements all of which must he in the considered subgroup, so it is infinite. Thus, a finite subgroup may contain only the numbers 1 and —1, which means that there are two finite subgroups: {1},{-M}.
□
12.F.40. For each of the following formulas, decide whether it correctly defines a mapping p. If so, decide whether it is a homomorphism, and if so, find its kernel. Moreover, decide whether it is surjective and injective:
i) p : Z4 x Z3 -> Z12, p(([a}4, [b]3)) = [a - b}12,
ii) p : Z4 x Z3 -> Z12, p(([aU, [b]3)) = [6a + 4b]12,
iii) p : Z4 x Z3 -> Z12, p(([a}4, [b]3)) = [0]12.
Solution.
i) Not a mapping. For instance, if we take two representatives ([6]4, [1]3) = ([2]2, [1]3) of the same element in Z4 x Z3, then we get <£>([6]4, [1]3) = [5]i2 and p([2]4, [1]3) = [l]i2), so this is not a correct definition of a mapping.
ii) A homomorphism, neither injective, nor surjective. Its kernel Kei(p) is the set {([2]4, [0]3), ([0]4, [0]3)}.
iii) A homomorphism, neither injective, nor surjective. Its kernel is the entire group Z4 x Z3. ^
12.F.41. For each of the following formulas, decide whether it correctly defines a mapping p. If so, decide whether it is a homomorphism, and if so, find its kernel. Moreover, decide whether it is surjective and injective:
i) p : Z4 -^C*,p([a}4) = ia,
ii) p:Z5^C*,p([a]5)=ia,
iii) p : Z4 -> C*, p{[a]A) = (-l)a,
iv) p : Z -> C*, p(a) = ia.
Solution.
i) Wehave<^([a]4 + [b]4) = ia+b = ia ■ ib = p([a]) ■ p([b]), and <p([4\) = iA = 1, which means that if [c]4 = [d]4, i. e., c = = d + 4k, k e Z, then p{[c]A) = ic = id+ik = id = p ([(i]4), so the mapping is a well-defined homomorphism. It is injective (which is equivalent to saying that Ker(p) = {[0]4]}), but it is clearly not surjective.
^The group of all invertible elements for M and C is denoted by M* and C*, respectively, and by Z„ forZ„.
12.3.5. Group homomorphisms. Recall that a mapping / : G —> H from a group G to a group H is ffflI2k.    called a group homomorphism if and only if it sSggEfc? respects the operation, i.e. for all a, b G G that
f(a-b) = f(a)-f(b).
Note that the operation on the left-hand side is the operation in G, before / is applied, while the operation on the right-hand side is the operation in H, after / is applied.
The following properties of homomorphisms follow easily from the definition:
Proposition. Every group homomorphism f : G —> H satisfies:
(1) the identity ofG is mapped to the identity of H,
(2) the inverse of an element of G is mapped to the inverse of its image, i.e.
f(a-1) = f{a)-\
(3) the image of a subgroup K C G is a subgroup f(K) C H,
(4) the preimage /_1(A) C G of a subgroup K C H is again a subgroup,
(5) if f is also a bijection, then the inverse mapping f~x is also a homomorphism,
(6) f is injective if and only if /_1(e^f) = {ec}.
Proof. (3) and (2), (1). If A C G is a subgroup, then for each y = f(a), z = f(b) in H, y ■ z = f(a ■ b) also lies in the image. The image of a subgroup contains the identity as well as inverses to each element, so is a subsemigroup. In particular, the image of the trivial subgroup {ec} is a sub-semigroup. Since z ■ z = z in H, it follows that z = &h after multipling by z~x.
So the only singleton subsemigroup in a group is the trivial subgroup {e^f}. Hence f(ea) = eh-
It follows directly from the definition of a homomorphism that
/(a"1) ■ /(a) = f(eG) = eH,
i.e. /(a)-1 = /(a-1). This proves the first three propositions.
Proceed similarly in the case of preimages: if a, b e G satisfy f(a)J(b) eK C H, then also f(a-b) e K.
Suppose there exists an inverse mapping g = Fix arbitrarily y = f{a), z = f(b) e H. Then, f(a-b) = y ■ z = /(a) • /(&), which is equivalent to the expression g(y) ■ g(z) = a ■ b = g(y ■ z). Thus the inverse mapping is also a homomorphism.
If f(a) = f(b), then f(a ■ b"1) = eH. Therefore, if the only element that is mapped to &h is ec, then a ■ b-1 = ec, i.e. a = b. The other implication is trivial. □
The subgroup f~1(en) (the preimage of the identity in H) is called the kernel of the homomorphism / and is denoted ker /. A bijective group homomorphism is called a group isomorphism.
835
CHAPTER 12. ALGEBRAIC STRUCTURES
ii) Not a mapping since we have [0]s = [5] 5 and yQO] 5) = i° = lbuty>([5]5) = i5 =i.
iii) Is a homomorphism, neither injective (we have — 1 = ip(l) = ip(3) = (—I)3 = —1), nor surjective. The kernel is Ker(p) = {[0]4, [2]4]}.
iv) Is a homomorphism, neither injective nor surjective. The kernel is Ker(p) = 4Z = {4k | fc £ Z}. n
12.F.42. For each of the following formulas, decide whether it correctly defines a mapping cc. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective:
i) P = Q* -5
ii) V : Q* h
iii) p : Q* -s
o
12.F.43. For each of the following formulas, decide whether it correctly defines a mapping ip. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective:
i) <f
ii) <P
iii) <P
iv) p
V)
vi) p
I, p(a + bi) = a + b,
I, p(a + bi) = a,
R*, p(a + bi) = a2 + b2,
R*, p(c) = 2\c\,
R*, p(c) = |c|3,
R*, p(c) = l/\c\.
O
12.F.44. For each of the following formulas, decide whether it correctly defines a mapping p. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective:
i) p: OC2(R) -
ii) p: OC2(R) -
iii) p : OC2(R)
p(A) = \A
At:
' a b
d
= a2 + b2. = ac + bd.
o
12.F.45. For each of the following formulas, decide whether it correctly defines a mapping p. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective:
It follows directly from the above ideas that a homomorphism / : G —> H with a trivial kernel is an isomorphism onto the image f(G).
12.3.6. Examples. The additive group Zk of residue classes modulo k is isomorphic to the group of fc-th roots of unity, and also to the group of rotations by integer multiples of 27r/fc. Draw a diagram, calculation with the complex units e2m/k is very efficient.
The mapping exp : R —> R+ is an isomorphism of the additive group of the real numbers onto the multiplicative group of the positive real numbers.
This isomorphism extends naturally to a homomorphism exp : C —> C\{0} of the additive group of the complex numbers onto the multiplicative group of the non-zero complex numbers. However, this homomorphism has a non-trivial kernel. The restriction of exp to purely imaginary numbers (which is a subgroup isomorphic to R) is a homomorphism
it H> erf = cos t + i sin t.
This means that the numbers 2kni, k e Z, lie in the kernel. It can be shown that nothing else is in the kernel. If es+rf =
1 for real numbers s and t, then es
1, i.e. s = 0,
and then t = 2kir for an integer k.
The determinant of a matrix is a mapping which assigns, to each square matrix of scalars in K, a scalar in K (the cases K = Z, Q, R, C have already been worked with). The Cauchy theorem about the determinant of the product of square matrices
&et(A- B) = (detA) ■ (det B)
can be also seen as the fact that for the group G = GL(n, K) of invertible matrices, the mapping det : G —> K \ {0} is a group homomorphism.
12.3.7. Group product. Given any two groups, a more complicated group can be constructed using the following construction:
Group product
For any groups G, H the group product G x H is denned as follows: The underlying set is the Cartesian product G x H and the operation is denned componentwise. That is,
(a,x) ■ (b,y) = (a-b,x- y),
where the left-hand operation is the one to be denned, while the right-hand operations are respectively those in G and H.
The projections onto the components G and H in the product:
pG : G x H 3 (a,b) i-> a e G, pH ■ G x H 3 (a,b) i-> b are surjective homomorphisms, whose kernels are
kerpG = {(eG,b); b £ H} ~ H,
keipn = {(a, en); a £ G} ~ G.
The group is isomorphic to the product Z2 x Z3. This can be seen easily in the multiplicative realization of
836
CHAPTER 12. ALGEBRAIC STRUCTURES
i) p : Z3 -j. A4, v5([a]3) = (1, 2, 4) o (1, 3, 2)a o (1, 4, 2)
ii) : Z3 -> A4, vj([a]3) = (1, 2) o (1, 3, 2)a
O
12.F.46. For each of the following formulas, decide whether it correctly defines a mapping p. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective:
i) ip : Z —> Z, y>(a) = 2a     iii) y> : Z —> Z,
ii) p : Z —> Z, v(a) = 3|a
i£>(a) = a + 1 iv) ^ : Z ^ Z,        = 1
o
12.F.47. For each of the following formulas, decide whether it correctly defines a mapping p. If so, decide whether it is a homomorphism. Moreover, decide whether it is surjective and injective:
i) V:ZxZxZ^Q*, p((a, b, c)) = 2a3b12c
ii) p : Z% x Z5 -> Z5, y((a, 6)) = &a
iii) piZjxZ^Z, v(([a]2, &)) = b
O
12.F.48. Prove that there exists no isomorphism of the multiplicative group of non-zero complex numbers onto the multiplicative group of non-zero real numbers. Solution. Every homomorphism must map the identity of the domain to the identity of the codomain (see 12.3.5). Thus, 1 must be mapped to itself. And what about — 1 ? We know that /(-l)2 = /((-l)2) = /(l) = 1. Therefore, the image of —1 is a square root of 1. Since we are interested in bijective homomorphisms only, we must have /(—1) = — 1. However, then f(i)2 = f(i2) = /(-l) = -1, so that f(i) is a square root of —1 in R; however, no such real number exists. Therefore, no bijective homomorphism may exist. □ Remark. The mapping which assigns the absolute value to each non-zero complex number is a surjective homomorphism of C* ontoR+.
G. Burnside's lemma
12.G.1. How many necklaces can be created from 3 black and 6 white beads? Beads of one color are indistinguishable,
the groups Zk as the complex fc-th roots of unity. Z6 consists of the points of the unit circle that form the vertices of a regular hexagon. Then, Z2 corresponds to ±1, while Z3 corresponds to the equilateral triangle, one of whose vertices is the number 1. If each point is identified with the rotation that maps 1 to that point, then the composition of such rotations is always commutative. Composing a rotation from Z2 with a rotation from Z3 yields exactly all rotations from Z6. Draw a diagram! This leads to the following isomorphism (using additive notation, as is common for the residue classes):
[0]e     ([0]2, [0]3),
[1W([1]2,[2]3),
[2W([0]2,[1]3),
[3W([1]2,[0]3),
[4]e     ([0]2, [2]3),
[5W([1]2,[1]3).
Similar constructions are available for finite commutative groups in complete generality.
12.3.8. Commutative  groups. Any  element  a  of a
group G is contained in the minimal subgroup {.. ., a~2, a-1, e, a, a2, a3,... }, which contains it. It is clear that this subgroup is commutative. If G is finite, then it must once
happen that ak = e. The least positive integer k with this property is called the order of the element a in G. A cyclic group G is one which is generated by one of its elements a in the above manner. If the order k of the generator is finite, then it results in one of the groups Ck, known from the discussion of symmetries of plane figures.
It follows directly from the definition that every cyclic group is isomorphic either to the group of integers Z (if it is infinite) or to one of the groups of residue classes Zk (if it is finite). These simple building stones are sufficient to create all finite commutative groups.
Theorem. Every finite commutative group G is isomorphic to a product of some cyclic groups Ck. The orders of the components Ck are always powers of the prime divisors of the number of elements n = \G\. This product decomposition is unique, up to order.
If n = pkl ■ ■ -pkr is the prime factorization of n, then the group Cn is isomorphic to the product
Cn = Cp*i x ■ ■ ■ x CpkT.
Proof. For a simpler case, suppose n = pq with p co-prime to q. Fix a generator a of the group Cn, a generator b of Cp, and a generator c of Cq. Define the mapping
/ : Cn -4- Cp x Cq by
f(ak) = (bk,ck). Since ak ■ ae = ak+e and similarly for b and c, it follows that
f(ak ■ ae) = (bk+e,ck+e) = (bk,ck) ■ (be,ce),
837
CHAPTER 12. ALGEBRAIC STRUCTURES
and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection.
Solution. Let us assume the necklace as coloring of the vertices of a regular 9-gon. Let S denote the set of all such colorings. Since each coloring is determined by the positions of the 3 black beads, we get that S has (g) = 84 elements.
We know that the group of symmetries is Dg, which contains 9 rotations (including the identity) and 9 reflections. Two colorings are the same if they lie in the same orbit of the action of Dg on the set S. Thus, we are interested in the number of orbits (let us denote it N). In order to find N, it suffices to compute the sizes of Sg for all elements g of Dg:
The identity is the only element of order 1, we have|S'id| = 84, so the contribution to the sum is 84.
There are 9 reflections g, each of order 2. Clearly, we have \ Sg\ = 4, so the total contribution is 4 ■ 9 = 36.
There are 2 rotations g by 27r/3 or 47r/3, both of order 3, and \Sg\ =3. Their contribution is 6.
Finally, there are 6 rotations of order 9, and no coloring is kept unchanged by them, so they do not contribute to the sum.
Altogether, we get by the formula of Burnside's lemma
that
E|5.|_i»_7.
96-D9
Draw the seven necklaces!
18
□
12.G.2. Find the number of colorings of a 3-by-3 grid with three colors. Two colorings are considered the same they can be transformed to each other by rotation and/or reflection.
Solution. The group of symmetries of the table is the same as for a square, i. e., it is the dihedral group D4. Without any identification, there are clearly 39 colorings of the table. Now, the group G = D4 acts on these colorings. For each symmetry g G G, we find the number of colorings that g keeps unchanged:
. 5 = Id: \Sg\ =39.
• g is a rotation by 90° or 270° (= -90°): In this rotation, every corner tile is sent to an adjacent corner tile. This means that if the coloring is to be unchanged, all the corner tiles must be of the same color. Similarly, all the edge tiles must be of the same color. Then, the center tile may be of any color. Altogether, we get that there are 33 which are not changed by the considered rotations.
so the mapping / is a homomorphism. If the image is the identity, then k must be a multiple of p as well as q. Since p and q are, coprime, k is a multiple of n, so / is injective. Moreover, the group Cn has the same number of elements as Cp x Cq, so / is an isomorphism.
Finally, the proposition about the decomposition of cyclic groups of order k into smaller cyclic groups follows by induction on the number of different primes pi in the factorization of n. □
Note that, on the other hand, Cp2 is never isomorphic to the product Cv x Cp. While Cpi is generated by an element of order p2, the highest order of an element in Cv x Cv is only P-
Since every finite commutative group is isomorphic to a product of cyclic groups, it is possible, for a given number of elements, to enumerate all commutative groups of that order up to isomorphism. For instance, there are only two groups of order 12:
C12 = G4 x G3,   C2xC2xC3 = C2x G6.
Notice similarly that if all elements (except the identity) of a finite commutative group G have order 2, then G has the form (C2)n for an integer n. In particular, such a group G has 2™ elements. If the decomposition of G into cyclic groups contains a group Cp, p > 2, then the group contains elements of higher order.
12.3.9. Subgroups and cosets. Selecting any subgroup H jii. of a group G, gives further information about the " structure of the whole group. A binary relation ~H on Ge can be defined as follows: a ~# b if and only fS 1 if &_1 ■ a G H. This relation expresses when two elements of G are "the same" up to multiplication by an element of H from the right. It is easily verified that this relation is an equivalence:
Clearly, a-1 ■ a = e G H, so it is reflexive. If b-1 ■ a = h G H, then a"1 ■ b = (Tr1 ■ a)"1 = h~x G H, so it is symmetric as well. Finally, if c-1 ■ b G H and &"1 ■ a G H, then c_1 ■ a = c_1 ■ b ■ &_1 ■ a G H, so it is transitive, too.
It follows that G partitions into the left cosets of mutually equivalent elements, with respect to the subgroup H. The coset corresponding to an element a is denoted a ■ H, and
a ■ H = {a - h; he H},
since an element b is equivalent to a if and only if it can be expressed this way.
The corresponding partition of G (i.e. the set of all left cosets) with respect to H is denoted G/H.
Similarly, right cosets H ■ a can be defined. The corresponding equivalence relation is given by a ~ b if and only if a ■ &_1 G H. Hence,
H \ G = {H ■ a; a G G}.
Proposition. Let G be a group and H a subgroup of G. Then:
838
CHAPTER 12. ALGEBRAIC STRUCTURES
• g is a rotation by 180°: There are four pairs of tiles that are sent to each other by this symmetry, which means that the two tiles of each pair must be of the same color. Then, the center tile may be of any color. Altogether, we have \Sg\ = 35.
• g is one of the four reflections: There are three pairs of tiles that are sent to each other by the reflection, so again the tiles within each pair must be of one color. The three tiles that are fixed by the reflection may be each of an arbitrary color. Altogether, we get \Sg\ = 36.
By Burnside's lemma, the wanted number of colorings is equal to
I (39 + 2 ■ 33 + 35 + 4 ■ 36) = 2862. □
8
12.G.3.
a) Find all rotational symmetries of a regular octahedron.
b) Find the number of colorings of its sides. Two colorings are considered the same they can be transformed to each other by rotation.
Solution.
a) Placing the octahedron into the Cartesian coordinate system so that pairs of adjacent vertices are on the axes and the center of the octahedron lies in the origin, then every rotational symmetry is given by which of the six vertices is on the positive x-semiaxis and which of the four adjacent vertices is on the positive y-semiaxis. Thus, the group has 24 elements. These are (besides the identity) rotations by ±90° and 180° around axes going through opposite vertices, rotations by 180° around axes going through the centers of opposite edges, and finally rotations around ±120° around axes going through the centers of opposite sides.
b) Without any identifications, there are 38 colorings. For each rotational symmetry g, we compute the number of colorings that are kept unchanged by it:
- g is a rotation by ±90° around an axis going through opposite vertices. Then, g fixes 32 colorings, and there are 6 such rotations.
- g is a rotation by 180° around an axis going through opposite vertices or the centers of opposite edges. Then, g fixes 34 colorings. There are 3 + 6 = 9 of these.
- g is a rotation by ±120°. Then, g also fixes 34 colorings, and there are 8 such rotations.
(1) The left cosets with respect to H coincide with the right cosets with respect to H if and only if for each a G G,
heH
a-h-a-1G H.
(2) Each coset (left or right) has the same cardinality as the subgroup H.
Proof. Both properties are direct consequences of the definition. In the first case, for any a G G, h G H, an element hi G H is required so that h- a = a-hi'. This occurs if and only if a-1 ■ h ■ a = hi G H.
In the second case, if a ■ h = a ■ hi, then multiplication by a-1 from the left yields h = hi. □
As an immediate corollary of the above statement, there are the following extremely useful results:
12.3.10. Theorem. Let G be a finite group with n elements and H a subgroup of G. Then:
(1) the cardinality n = \G\ is the product of the cardinality of H and the cardinality ofG/H, i. e.
\G\ = \G/H\ ■ \H\,
(2) the integer \ H\ divides n,
(3) if a G G is of order k, then k divides n,
(4) for each a G G, an = e,
(5) if n is prime, then G is isomorphic to the cyclic group Zn.
The second proposition is called Lagrange's theorem, The fourth proposition is called Fermat's little theorem. Special cases are discussed in the previous chapter on number theory.
Proof. Each left coset has exactly \H\ elements. However, different cosets are disjoint. Hence the first proposition follows.
The second proposition is a direct corollary of the first
one.
Each element a G G generates the cyclic subgroup {a, a2,... ,ak = e}, and the order of this subgroup is exactly the order of a. Therefore, the order of a must divide the number of elements of G.
Since the order k of any element a divides n and ak = e, then also an = (ak)s = e for any integer s.
If n > 1, then there exists an element a G G that is different from e. Its order k is an integer greater than one and it divides n. Therefore, k must be equal to n. This means that all the elements of G are of the form ae for £ = 1,..., n. □
12.3.11. Normal subgroups and quotient groups. A sub-y-''v,    group // which satisfies a ■ h - a-1 G H for all \?L     .f^Sl' a G G, h G H, is called a normal subgroup. fx/Tlp For normal subgroups, the operation on
jaU    G/H can be denned by
(a ■ H) ■ (b ■ H) = (a ■ b) ■ H.
839
CHAPTER 12. ALGEBRAIC STRUCTURES
Together with 38 for the identity, we get that the number of colorings is
^ (38 + 6 ■ 32 + 17 ■ 34) = 333. D
12.G.4. How many necklaces can be created from 9 white, 6 red, and 3 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection.
Solution. The group of symmetries of the necklace is the dihedral group Dls, which has 36 elements. It acts on the set of necklaces, where we can number each place (1 through 18), resulting in 18!/(9!6!3!) = 4084080 necklaces (without any identification). The only symmetries that fix a non-zero number of necklaces are rotations by 120° and 240°, reflections, and of course the identity. By Burnside's lemma, the wanted number of necklaces is equal to
1 (4084080 + 2- QQ+9- (J) Q) = 113590.
□
12.G.5. How many necklaces can be created from 6 white, 6 red, and 6 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection.
o
12.G.6. How many necklaces can be created from 8 white, 8 red, and 8 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection.
o
12.G.7. How many necklaces can be created from 3 white and 6 black beads? Beads of one color are indistinguishable, and two necklaces are considered the same if they can be transformed to each other by rotation and/or reflection. q
Choosing other representatives a ■ h,b ■ hi leads to the same result:
(a-h-b-h')-H = ((a-b) ■ (b'1 ■ h-b) ■ ti) ■ H = (a-b)-H.
Hence for a normal subgroup, it does not matter whether right or left cosets are used. Thus, cosets can be written as H ■ a ■ H, and the equation (H ■ a) ■ (b ■ H) = H ■ (a ■ b) ■ H is straighforward.
Clearly, this new operation on G/H satisfies all group axioms: the identity is the group H itself (formally it is the coset e-H that corresponds to the identity e of G), the inverse of a ■ H is a-1 ■ H, and the associativity is clear from the definition. THis is called the quotient group G/H of G by the normal subgroup H.
Of course, in commutative groups, every subgroup is normal. The subset
nZ = {na; a £ Z} C Z
is a subgroup of the integers, and the corresponding quotient group is the (additive) group Z„ of residue classes.
It is clear from the definition that the kernel of every ho-momorphism is a normal subgroup. On the other hand, if a subgroup H C G is normal, then the mapping
p : G -> G/H, a^a-H
is a surjective homomorphism, whose kernel is H. p is well-defined. It can be seen directly from the definition of the operation on G/H that p is a homomorphism, and it is clearly surjective. It follows that normal subgroups are the kernels of homomorphisms.
Moreover, for any group homomorphism / : G —> K with kernel H = ker /, there is a well-defined homomorphism
f-.G/kevf ^K, f(a-H) = f(a), which is injective.
There is a seemingly paradoxical example of a group homomorphism C* —> C*, denned on the non-zero complex numbers by z ^ zk, where k is a fixed positive integer. Clearly, this is a surjective homomorphism, and its kernel is the set of fc-th roots of unity, i.e. the cyclic subgroup Z^. Reasoning as above, there is an isomorphism
/ : CVZfc -> C*
for any positive integer k. This example illustrates that in the case of infinite groups, the calculations with cardinalities are not so intuitive as in the case of finite groups and theorem 12.3.10.
12.3.12. Exact sequences. A normal subgroup H of a group G yields the short exact sequence of groups
e^H ^ G ^ G/H -4- e,
where the arrows respectively correspond to the only homomorphism of the trivial group {e} into the group H, the inclusion i of the subgroup H C G, the projection v onto the quotient group G/H, and the only homomorphism of the group G/H onto the trivial group {e}. In each case, the image of
840
CHAPTER 12. ALGEBRAIC STRUCTURES
H. Codes
12.H.1.   Consider the (5,3)-code over Z2 generated by the polynomial x2 + x + 1. Find all codewords as well as the generating matrix and the check matrix. Solution. p(x) = x2 + x + 1. The code words are precisely the multiples of the generating polynomial:
0-p, l-p,x-p, (x+l)-p,x2-p, (x2+l)-p, (x2+x)-p, (x2+x+l)-p
0, x2 + X + 1, x3 + x2 + X, x4 + X,
x3 + l,x4 +x3 +x2, x4 + x2 + 1
x4 + x3 + X +
00000,11100, OHIO, 10010, 00111,11011, 01001,10101.
The basis vectors multiplied by a;5 3
x2 = x + 1,
; yield mod (p):
= x(x + 1) = x2 + X = 1,
This means that the basis vectors are encoded as follows:
1 i-> xz +x + 1,
X h> X „2
+ 1,
i. e.
h> x + x, Thus, the generating matrix is
G =
and the check matrix is
100 m- 11100, 010 m- 10010, 001 m- 01001.
	1	
i	0	1
i	0	0
0	1	0
\o	0	V
H =
□
12.H.2. Consider the (5,3)-code over Z2 generated by the polynomial x2 + x + 1. Find the generating matrix and the check matrix of the (7,4)-code over Z2 generated by the polynomial x3 + x + 1. O
12.H.3. A 7-bit message ao a±... ag, considered as the polynomial a0+ai aH-----h a6 a;6, is encoded using the polynomial
code generated by x4 + x + 1.
i) Encode the message 1100011.
ii) You have received the word 10111010001. What was the message if you assume that at most one bit was flipped?
one arrow is precisely the kernel of the following one. This is the definition of exactness of a sequence of homomorphisms.
If there exists a homomorphism a : G/H —> G such that voa = idG    it is said that the exact sequence splits.
Lemma. Every split short exact sequence of commutative groups defines an isomorphism G —> H x G/H.
Proof. Define a mapping f : H x G/H ^ G by
f{a,b) = a ■ a(b). Since the groups are commutative, / is a homomorphism.
/ (aa', 66') = aa'a{b)a{b') = (aa(b)) (a'a (6')).
If /(a, 6) = e, then a(b) = a"1 G H, i.e. 6 = v(a(b)) is the identity in G/H. However, its image is then a(b) = e, so a = e. Since the left and right cosets of commutative groups coincide, the mapping / is surjective. Hence / is an isomorphism. □
Now, the main idea of the proof of theorem 12.3.8 can be indicated. If it is known that every short exact sequence created by choosing cyclic subgroups H of a finite commutative group G splits, then it is easy to proceed with the proof by induction. If G is a group of order n which is not cyclic, then an element a of order p,p < n, can selected. The cyclic subgroup H generated by a can be found as well as the splitting of the corresponding short exact sequence. This expresses the group G as the product of the selected cyclic subgroup H and the group G/H of order n/p.
The main technical point of the proof is the verification that in each finite commutative group, there are elements of order pr with appropriate powers of the primes p and that the short exact sequences for these groups really split.
12.3.13. Return to finite Abelian groups. Below is a brief \\ exposition of the complete proof of the classification theorem, broken into several steps. iWM,      The following lemma suggests that cyclic W subgroups with prime orders are required.
Lemma (Claim 1). Let G be a finite Abelian group with n elements. If p is a prime which divides n, then there is an element g G G of order p.10
Proof. The claim is obvious if n is prime, i.e. G = Zp (as proved above). If n is not prime, proceed by induction on n. Clearly G must have a proper subgroup H if n is not prime, \H\ = 77i < 7i. Either p1771 oxp\(n/ra). In the former case, the claim follows from the induction hypothesis directly.
Otherwise assume p\(n/m). Then there is an element g G G such that the order of g ■ H in the quotient group G/H is p. Since the size \g ■ H\ divides the order of g in G, the order of g is Ip for some integer I. Hence the element ge has the required order p. □
10Thls is a special version of the more general result valid for all finite groups, called the Cauchy theorem. The formulation remains the same, with the word Abelian omitted.
841
CHAPTER 12. ALGEBRAIC STRUCTURES
iii) What was the message in ii) if you assume that exactly two bits were nipped?
Solution, i)
x4    =    x + 1, „2
x
x" + x, x3 + x,
x2
+ x+1,
whence
For any prime number p, G is called a p-group if each of its elements has order pk for some power k. Claim 1 has an obvious corollary:
Lemma (Claim 2). A finite group G is a p-group if and only if its number of elements n is a power of p.
Proof. One implication follows straight from the Lagrange's theorem since all proper divisors of a power of a prime p are just smaller powers of p.
On the other hand, if n is not a power of prime, it has another prime divisor q and so there is an element of order q by Claim 1. □
Now it can be shown that a given finite Abelian group G can always be decomposed into a product of p-groups.
Lemma (Claim 3). If G is a finite Abelian group than it is isomorphic to a product of p-groups. This decomposition is
1 + x + x° + x° l->   x* + x° + xv + xw + x + 1 + xz + x-f
unique up to order.
x3 + x + x2 +x + 1
Thus, the code is 00011100011.
ii) 1 + x2 + x3 + x4 + x6 + x10 divide by x4 + x + 1 gives remainder x2 + 1 = xs. Thus, the ninth bit was nipped and the original message was 1010101.
iii) Either the first and third bits were nipped (x2 +1), or the fifth and sixth were (x4 + x5 = x2 + 1). In the first case, the message was 1010001, while in the second case, it was 0110001. □
12.H.4. A 7-bit message a0 ai... a6, considered as the polynomial a0+ai aH-----h a6 a;6, is encoded using the polynomial
code generated by x4 + x3 + 1.
i) Encode the message 1101011.
ii) You have received the word 01001011101. What was the message if you assume that at most one bit was nipped?
iii) What was the message in ii) if you assume that exactly two bits were nipped?
_ Proof. Consider a prime p dividing n = \G\. Define Gp to be the subgroup of all elements whose orders are powers of x3 + x4 + x5 + x9 + x10pt while G'p is the subgroup of all elements whose orders are not divisible by p (check yourself that subgroups are obtained in this way). By the above Claim 1, the subgroup Gp is not trivial.
Next, consider an element g of order qpe with q not di-visible by p. Then gp has order q, so this element belongs
Solution, i)
x4	= x3	+ 1,
x5	= x3	+ x+1,
xr	= x2	+ x+1,
x9	= x2	+ 1,
x10	= x3	+ x,
to Gp, while gq G Gp. The Bezout equality guarantees there are integers a and s such that apl + sq = 1. Hence
rpk sq
9 = 9 -9
is a decomposition of g into product of elements in Gv and G'p. This verifies G ~ Gp x G'p and Gp is a p-group.
This process can be repeated for the subgroup G'p and the remaining primes in the decomposition in order to complete the proof.
The uniqueness claim is obvious. □
It remains to consider the p-groups only. The next claim shows that the p-groups which are not cyclic must have more than one subgroup of order p.
Lemma (Claim 4). If a finite Abelian p-group G has just one subgroup H with \H\ = p, then it is cyclic.
Proof. The case p = n = \G\ is obvious. Proceed by induction on n. Assume H is the only subgroup of order p and consider a : G —> G, a(g) = gp and write K = ker(er). Then H c K and since p is prime, all elements in K have order p. For any g e K, the cyclic group generated by g has order p and so coincides with H and consequently H = K. If G 7^ K, then <j(G) is a non-trivial subgroup in G which must be isomorphic to G/K. By Claims 1 and 2, there is a subgroup in <j(G) of order p. This yields a subgroup in G and by assumption it is again H. Finally, apply the induction hypothesis on the group <j(G) ~G/h, which has to be cyclic. Choosing a generator g ■ H of the latter group, even in
842
CHAPTER 12. ALGEBRAIC STRUCTURES
thus we get
1 + 2 + x3 + x5 + x6 l-> x4 + x5 + x7 + x9 + x10 + x3 +
+ 1 +a3+ 2 + 1 +a2+ 2 + 1 +a2+ 1 +a3+ 2 =
= x3 + x4 + x5 + xr + x9 + x10 + x3 + x.
Therefore, the code is    0101 1101011.
redundancy message
ii) x + x4 + x6 + x7 + xs + x10 divided by x4 + x3 + 1 gives remainder x2 + x + 1 = x7. Thus, the eighth bit was flipped, and the original message was 1010101.
iil) Either the second and tenth bit were flipped (x+x9 = 22+2+l), orthe fourth and seventh (a;3+a;6 = 22+2+l),or the fifth and ninth (x4 +x8 = x2+x+l). The respective messages are 00001011111,01011010101, and 01000011001. □
12.H.5. Consider the (15, ll)-code generated by the polynomial 1 + x3 + x4. We have received the word 011101110111001. Find the original 11-bit message, provided exactly one bit was flipped.
Solution. The word is a codeword if and only if it is divisible by the generating polynomial 1 + x3 + x4. The received word corresponds to the polynomial x + x2 + x3 + x5 + x6 + x7 + x9 + x10 + x11 + x14. When divided by 1 + x3 + x4, it leaves remainder x + 1. This means that an error has occurred. If we assume that only one bit was flipped, then there must be a power of x which is equal to this remainder modulo 1 + x3 + x4. Thus, we compute x4 = x3 + 1, x5 = x3 + x + 1,..., x12 = x + 1 and find out that the thirteenth it was flipped, and the original message was 01110111101.
Let us look at the exercise more thoroughly. Computing all powers of a, we obtain
x4	=	x3 + l,
x5	=	x3 +x + 1,
x6	=	x3 +x2 + X + 1,
x7	=	x2 +x + 1,
xs	=	a3 + a2 + a,
x9	=	22 + l,
x10	=	a3 + x,
x11	=	x3 +x2 + 1,
x12	=	2 + 1,
x13	=	22 + 2,
x14	=	23 +22,
G the cyclic subgroup generated by g must have a subgroup of order p (again Claim 1). The uniqueness assumption ensures that this is again the subgroup H. Hence the group G is cyclic. □
Finally, a splitting condition for the p-groups is proved, which provides the property discussed in the end of the previous paragraph on the exact sequences. This completes the entire proof of the classification theorem.
Lemma (Claim 5). Let G be a finite Abelian p-group and let C be a cyclic subgroup of maximal order in G. Then G = C x L for some subgroup L.
Proof. If G is cyclic, set G = C and of course G = C x L with L = {e}. Proceed by induction on n = \G\. Assume G is not cyclic. Then it contains more than one cyclic subgroup of order p. Of course the subgroup C is one such subgroup. Choose H to be another subgroup of order p which is not a subgroup in C. Since p is prime, the intersection of H and C is trivial. Consequently the quotient group (C x H)/H C G/H is isomorphic to C.
Now consider the induction step. The order of the cyclic subgroup (C x H)/H in G/H must be maximal, since the orders of the elements g ■ H in the quotient group are divisors of the orders of the generators g in the group G. By the induction hypothesis,
G/H = (C x H)/H x K
for some subgroup K C G/H. Clearly the preimage of K under the quotient projection is a group L satisfying H C L c G. Now, the latter identification of G/H with the product implies
G = (C ■ H) ■ L = C ■ (H ■ L) = C ■ L.
At the same time, L n (C ■ H) = H and so L n C = {e}. So G = C x L. □
The proof is complete up to the uniqueness claim. It is known already that the decomposition into p-groups is unique. Assume that a p-group G decomposes into two products of cyclic groups Hi x ... x Hk and H[ x ... x H'e with non-increasing orders of H{ or H'j. Then the orders of H1 and H[ coincide, since these are the maximal orders in G. By induction all the orders coincide and the work is complete.
The classification theorem is a special case of a more general result on finitely generated Abelian groups. In additive notation, if gi,..., gt are generators of the entire G, then all elements of G are of the form a1g1 + ■ ■ ■ + atgt with integer coefficients a,. The general theorem provides a severe restriction for possible relations between such combinations. In fact it says that all finitely generated Abelian groups are products of cyclic groups, hence G = Ze®ZPl x... x ZPk. This means there is always a finite number of completely independent generators of G and each of them generates a cyclic subgroup in G. (Compare this to the description of finite dimensional vector spaces via their basis, as discussed in chapter 2.)
843
CHAPTER 12. ALGEBRAIC STRUCTURES
so the generating matrix is
G =
/1	1	1	1	0	1	0	1	1	0	0 ^
0	1	1	1	1	0	1	0	1	1	0
0	0	1	1	1	1	0	1	0	1	1
1	1	1	0	1	0	1	1	0	0	1
1	0	0	0	0	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0
0	0	0	0	1	0	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	0	0	0	1	0	0
0	0	0	0	0	0	0	0	0	1	0
V0	0	0	0	0	0	0	0	0	0	1J
We can verify that multiplication by 01110111101 yields the codeword 011101110111101, which differs from the received word 011101110111001 exactly in the thirteenth bit □ Now, we begin to efficiently use the check matrix.
12.H.6.   Find the generating matrix and the check matrix of /) S2<^S%*   tne    2)-code (i. e., there are 2 bits of / ^#£jl.7 \" \ tne messa8e     5 redundant bits) gener-y ^ ^ polynomial x5 + x4 + x2 +1.
Decode the received word 0010111 (i. e., find the message that was sent) assuming that the least number of errors occurred.
Solution. The generating matrix of the code is
A A
0 1
1 1 G =   0 1
1 1 1 0 \0 1/
The generating matrix is of the form G ■
0 1
1 1 0 1
in our case
where P ■
The check matrix is of the form
H
/l 0   0   0   0 1 1\
0 1   0   0   0 0 1
0 0   1   0   0 1 1
0 0   0   1   0 0 1
\0 0001 1 1/
12.3.14. Group actions. Groups can be considered as sets of transformations of a fixed set. All the transformations are invertible, and the set of transformations must be closed under composition. The idea is to work with a fixed group whose elements are represented as mappings on a fixed set, but the mappings corresponding to different elements of the group need not be different. For instance, the rotations around the origin over all possible angles correspond to the group of real numbers. On the other hand, the rotation by 27r is the identity as a mapping.
Formally, this situation can be described as follows:
Group actions
A left action of a group G on a set S is a homomorphism of the group G to the subgroup of invertible mappings in the monoid Ss of all mappings S —> S. Such a homomorphism can be viewed as a mapping p : G x S ^ S which satisfies
if (a ■ b, x) = p(a, p(b, x)),
hence the name "left action". Often, the notation a-x is used to refer to the result of an a e G applied to a point x e S (although this is a different dot than the operation inside the group). Then, the definition property can be expressed as
(a ■ b) ■ x = a ■ (b ■ x).
The image of a point x e S in the action of the entire group G is called the orbit Sx of x, i.e.
Sx = {y = p{a,x); a e G}.
For each point x e S, define the isotropy subgroup Gx C G of the action ip:
Gx = {a £ G; p(a, x) = x}.
If for every two points x,y e S, there is an a e G such that p(a, x) = y, then the action p is said to be transitive.
Choosing any two points x,y e S and a g e G which maps x to y = g ■ x, then the set {ghg-1; h e Gx} is clearly the isotropy subgroup Gy. In addition, the mapping h i-> ghg~x is a group homomorphism Gx —> Gy.
In the case of transitive actions, the entire space forms a single orbit and all isotropy subgroups have the same cardinality.
As an example of a transitive action of a finite group, consider the apparent action of the symmetric group of a fixed set X on X itself. The natural action of all invertible linear transformations on the non-zero elements of a vector space V is also transitive. However, if the entire space V is selected, then the zero vector forms its own orbit.
The mentioned example of the action of the additive group of real numbers that acts as rotations around a fixed center O in the plane is not transitive. The orbit of each point M is the circle which is centered at O and goes through M.
844
CHAPTER 12. ALGEBRAIC STRUCTURES
Multiplying the received word by the check matrix, we get the syndrome (error) of the word:
									(°\
		0	0	0	0	1			0
	0	1	0	0	0	0	1		1
H =	0	0	1	0	0	1	1		0
	0	0	0	1	0	0	1		1
		0	0	0	1	1	V		1
									vJ
= (	0	1	1	1	1).				
The syndrome corresponding to the received word is 01111. Now, we find all words corresponding to this syndrome. This can be done by adding all codewords to the received word. There are four codewords, corresponding to the four possible messages. They are obtained by multiplying the messages (00, 01, 10, 11) by the generating matrix. Thus, we get the codewords
0000000,1111101,1010110,0101011.
The space of words corresponding to a given syndrome is an affine space whose direction is the vector space of all codewords (see 12.4.8). Thus, we get the words
0010111,1101010,1000001,0111100.
The least number of errors is equal to the least number of ones in the obtained words. In our case, this is the word 1000001, which contains only two ones and thus is the so-called leading representative of the class of words with syndrome 01111. The original message can be obtained by subtracting (or adding - this is equivalent in Z2) the received word and the leading representative of the class with the given syndrome. In our case, we get
0010111 - 1000001 = 1010110.
Therefore, assuming the least number of errors, the sent word was 1010110, where the last two bits are the original message, i. e., 10. □
12.H. 7. Consider the (7,3) -code generated by the polynomial x4 + x3 + x + 1. Find its generating matrix and check matrix. Using the method of leading representatives, decode the received word 1110010. O
A typical example of a transitive action of a group G is the natural action on the setG/H of left cosets for any subgroup H. It is defined by
g-(aH) = (ga)H.
This is the form of every transitive group action. For any transitive action GxS^S and a fixed point x G G, S can be identified with the set G/Gx of left cosets by gGx i-> g-x. Clearly, this mapping is surjective, and the images g-x = h-x coincide if and only if h~xg G Gx, which is equivalent to gGx = hGx. Finally, note that this identification transforms the original action of G on S just to the mentioned action of G on G/Gx.
12.3.15. Theorem. Let an action of a finite group G on a finite set S be given. Then:
(1) for each point x G S,
\G\ = \GX\ ' \SX\,
(2) (Burnside's lemma) if N denotes the number of orbits, then
where Sg = {x G S; g ■ x = x} denotes the set of fixed points of the action corresponding to an element g.
Proof. Consider any point x G S and its isotropy subgroup Gx C G. The same argument as the one at the end of the previous paragraph for transitive group actions can be applied to each action of the group G. This gives the mapping G/Gx —> Sx, g ■ Gx i-> g ■ x. If g ■ x = h ■ x, then clearly g~xh G Gx, so this mapping is injective. Clearly, it is also surjective, which means that the cardinalities of the finite sets must satisfy \G/GX\ = \SX\. The first proposition follows,because \G\ = \G/GX \ ■ \GX\.
The second proposition is proved by counting the cardinality of the set of fixed points of individual group elements in two different ways:
F = {(x,g) eS x G; g(x) =x}CSxG.
Since these are finite sets, the elements of the Cartesian product S x G can be considered as entries of a matrix (columns are indexed by elements of S, rows are indexed by elements of G). Summing this matrix up, either by rows or by columns, yields
g<EG xeS
For the sake of clarity, choose one representative for each orbit (denote them x1,..., x^), and recall that the cardinalities of the isotropy groups for points that lie in the same orbit are the same. Using the proved statement (1), it is obtained easily that
N N
=      = E E \Gx\ = Y,\s*'\\G**\ = N-\G\>
geG j=i i6SXi j=i
which completes the proof. □
845
CHAPTER 12. ALGEBRAIC STRUCTURES
12.H.8. Consider the linear (7,4)-code (i. e., the message has length 4) over Z2 denned by the matrix
	1	1	
1	1	0	1
1	0	1	1
1	0	0	0
0	1	0	0
0	0	1	0
\o	0	0	V
Decode the received word 1010001 (i. e., find the sent message) assuming that the least number of errors occurred.
Solution. There are 24 = 16 possible messages. All codewords can be obtained by multiplying the possible messages (0000, 0001,..., 1111) by the generating matrix of the code. Thus, we get:
It is recommended to think out thoroughly how useful these propositions are for solving combinatorial problems, c.f. 12.G.1 through 12.G.4.
4. Coding theory
It is often needed to transfer data while guaranteeing that they are transferred correctly. In some cases, it suffices to recognize whether the data is unchanged. In some cases, it can be transferred again. In other cases, this might not be reasonable, so that the data is needed to be recovered after a small number of errors in the transfer. This is the goal of coding theory. Some of the algorithms are explored now.
Notice that coding is quite different from encrypting. If no one but the addressee is meant to be able to read the message, then it should be encrypted. This topic is discussed briefly at the end of the previous chapter.
0110001, 1010010, 1100100, 0111000
1100011, 1010101, 0001001, 1011100
1101010, 0110110, 0001110, 1101101
1011011, 0000111, 0111111, 0000000.
Now, we construct the check matrix of the given code:
/l   0   0   0   1   1 0\ H= \ 0   1   0   1   1   0 1. \0 011011/
(we remove the block of the generating matrix that consists of the identity matrix and to the left of the remaining block, we write the identity matrix of fitting size). Now, multiplying the vector of the received word zT = (1010001) by H, we getthe syndrome s = Hz = (110)T. One word with this syndrome is 1100000 (we fill the syndrome with zeros to the appropriate length). All words with syndrome 110 are obtained by adding this word to all codewords. Thus, we get the words
1000001, 0110010, 0000100, 1011000,
0000011, 0110101, 1101001, 0111100,
0001010, 1010110, 1101110, 0001101,
0111011, 1100111, 1011111, 1100000
Out of these words with syndrome 110, only the word 0000100 contains a single one, so this is the leading representative of the class of words with syndrome 110. Subtracting the leading representative from the received word, we get the word that was sent, assuming the least number of bit flips (1 in this case), i. e., the word (101)0101, where the last four bits are the message, i. e. 0101. □
12.4.1. Codes. Data transfer is usually prone to errors.
Since any information may be encoded as a '■S^S^'sequence of bits (zeros or ones), the work is 'j^p^r^   with Z2 although the theory may be developed
■— even for a general finite field. Furthermore, the length of the message to be transferred is assumed to be known in advance. Thus, one transfers fc-bit words, where k e N is fixed.
It is desired to detect potential errors, and if possible, recover the original data. For this reason, further (n—k) bits are added to the fc-bit word, where n is also fixed (and of course n > k). These are called (n, k)-codes.
There are 2k binary words of length k and each should be mapped to one of the 2™ possible codewords. For an (n, k)-code, there remain
2™
2fe(2n"
words which are not codewords (if such a word is received, then an error has occurred). Thus, even for a large value of k, only a few bits added provide much redundant information.
The simplest example is the parity check code. Having a message of length k, the codeword is created by adding a bit whose value is determined so that the total number ones would be even. This is an example of a (k + 1, fc)-code.
If there occur an odd number of errors during the transfer, then it can be detected with this simple code. Every two codewords differ in at least two bits, but an error word differs from at least two codewords in only one bit. Therefore, this code is unable to recover the original message, even with the assumption that only one bit was changed.
The following diagram illustrates all 2-bit words with the parity bit added. The codewords are marked with a bold dot.
846
CHAPTER 12. ALGEBRAIC STRUCTURES
12.H.9. Consider the linear (7,4)-code (i. e., the message has length 4) over Z2 denned by the matrix
A	1	0	1\
0	0	1	1
i	0	1	0
i	0	0	0
0	1	0	0
0	0	1	0
V>	0	0	V
Decode the received word 1101001 (i. e., find the sent message) assuming that the least number of errors occurred.
Solution. Syndrome 101, leading representative 0001000, sent message (110)0001 □
12.H.10. Consider the linear (7,4)-code (i. e., the message has length 4) over Z2 denned by the matrix
A	0	1	1\
0	1	0	1
0	1	1	0
i	0	0	0
0	1	0	0
0	0	1	0
	0	0	V
Decode the received word 0000011 (i. e., find the sent message) assuming that the least number of errors occurred.
Solution. Syndrome 011, leading representative 0000100, sent message (000)0111. □
12.H.11. Consider the linear (7,4)-code (i. e., the message has length 4) over Z2 denned by the matrix
	1	1	1\
1	0	1	0
1	1	0	0
1	0	0	0
0	1	0	0
0	0	1	0
V>	0	0	V
Decode the received word 0001100 (i. e., find the sent message) assuming that the least number of errors occurred.
Solution. Syndrome 110, leading representative 0000010, sent message (000) 1110. □
12.H.12. We want to transfer one of four possible messages s\^iv with a binary code which should be able to correct all 'Af/Vc single errors. What is the minimum possible length IP    of the codewords (all codewords have to be of the
same length)? Why?
011
010
^^001		
	i	110
	1	
ilOl
000
100
Moreover, the parity check code is unable to detect the error of interchanging a pair of adjacent bits, which often happens.
12.4.2. Word distance. In the diagram of the parity check (3,2)-code, each error word is at the "same" distance from three codewords - those which
1 differ in exactly bits. The other words are farther. Formally, this observation can be described by the following definition of distance:
Word distance
The Hamming distance of a pair of words (of equal length) is the number of bits in which they differ.
Consider words x, y, z such that x and y differ in r bits, and y and z differ in s bits. Then, x and z differ in at most r + s, which verifies the triangle inequality for distances.
If the code is to detect errors in r bits, then the minimum distance between each pair of codewords must be at least r+1. If the code is to recover errors in r bits, then there exists only one codeword whose distance from the received word is at most r. Thus, the following propositions are verified:
Theorem. (1) A code reliably detects at most r errors if and only if the minimum Hamming distance of the codewords is r + 1.
(2) A code reliably detects and recovers at most r errors if and only if the minimum Hamming distance of the codewords is 2r + 1.
12.4.3. Construction of polynomial codes. For practical applications, the codewords should be constructed efficiently so that they can be easily recognized among all the words. The parity check code is one example; another trivial
possibility is to simply repeat the bits. For instance, the (3, l)-code can be considered which triplicates each bit.
A systematic way for code construction is to use division of polynomials. Amessage &o^i • • • bk-i is understood as the polynomial
m(x) = &o + bix + ■ ■ ■ + bk-ixk~1
over the field Z2. The encoded message should be another polynomial v(x) of degree at most n — 1.
847
CHAPTER 12. ALGEBRAIC STRUCTURES
Solution. Let us denote the desired length as n. The minimum Hamming distance of any two codewords must be at least three. This means that if we take two different codewords and flip one bit in each, the resulting words must also be different (and also different from each codeword). There are n + 1 words that can be obtained from a given one by flipping at most one bit (this includes the original word itself as well). Thus, we must have at least 4(ti + 1) possible words. On the other hand, there are 2™ words of length n, which means that 4(n + 1) < 2™. This inequality is satisfied only for n > 5. Thus, the codewords must be at least 5 bits long. And indeed, there are four codewords of length 5 with minimum Hamming distance 3, for instance 00111, 01001, 10100,11010. □
12.H.13.   We want to transfer 4-bit messages with a binary \\N     code which should be able to correct all single and
fr double errors. What is the minimum possible length of the codewords (all codewords have to be of the same length)? Why?
Solution. We proceed similarly as in the above exercise. If the code is to correct double errors as well, then the minimum Hamming distance of any two codewords must be at least three. This means that if we take two different codewords and flip up to two bits in each, the resulting words must also be different. Denoting by n the length of the words, we get the inequality
Polynomial codes
1+71 +
< 2™.
The least value of n for which it is satisfied is n = 12, so the codewords must be at least 12 bits long. □
I. Extension of the stereographic projection
Let us try to extend the definition of the stereographic projection so that a circle would be parametrized by points of Pi(R). Let us look at the corresponding mapping Pi(R) —> P2(R2). The points in projective extensions will be denned in the so-called homogeneous coordinates, which are given up to a multiple. For instance, the points in P2 (R) will be (x:y: z).
A circle in the plane z = 1 is given as the intersection of the cone of directions denned by x2 + y2 — z2 = 0 with this plane. The inversion of the stereographic projection (i.
Letp(x) = a0 + ■ ■ ■ + an-kXn~k £ Z2[x] be a polynomial with coefficients ao = 1, an-k = 1. The polynomial code generated by a polynomial p(x) is the (71, fc)-code whose codewords are polynomial of degree less than 71 divisible by p(x). A message 771(2;) is encoded as
v(x) = r(x) + :
i(x),
where r(x)is the remainder in the division of the polynomial
xn~km(x) by p(x).
Check the claimed properties first. By the very definition of the codeword v(x) of a message m(x):
v(x) = r(x) + xn~km(x) =
= r(x) + q(x)p(x) + r(x) = q(x)p(x),
since the sum of two identical polynomials is always zero over Z2. Therefore, all codewords are divisible by p(x).
On the other hand, if v (x) is divisible by p(x), the above calculation can be read from right to left (setting r(x) = xn~km(x) — q(x)p(x)), and so it is a codeword created by the above procedure.
From the definition, the codeword is created by adding 71 — k bits given by r(x) at the beginning of the word (and simply shifting the message to the right by that). It follows that the original message is contained in the polynomial v(x) and the decoding is easy.
Consider the two simple examples, already mentioned. First, note thatp(a;) = l+a;dividesw(a;)ifandonlyifw(l) = 0. This occurs if and only if v (x) has an even number of nonzero coefficients. So the polynomial p(x) = l + x generates the parity check (71,71 — l)-code for any 71 > 2.
Similarly, it is easily verified that the polynomial
p{x)
l + x + -
+ xn
generates the (71, l)-code of 71-fold bit repetition. Dividing the polynomial bgx71-1 by p(x), gives the remainder &o(l +
+ X11
), so the corresponding codeword is b0p(a
12.4.4. Error detection. Let e(x) denote the error vector. jj> 1. This is, the difference between the original message v e (Z2)n and the received data u:
ft) I u(x) = v(x) + e(x).
The error is detected if and only if the generator of the code (i.e. the polynomial p(x)) does not divide e(x). Therefore, polynomials over Z2 [x] which do not happen to be divisors too often are of interest.
Definition. An irreducible polynomial p(x) e Z2[x] of degree m is said to be primitive if and only if p(x) divides (1 + xk) for k = 2m — 1, but not for any smaller value of k.
Theorem. Let p(x) be a primitive polynomial of degree m and 71 < 2m — 1. Then the polynomial (71,71 — m)-code generated by p(x) detects all simple and double errors.
848
CHAPTER 12. ALGEBRAIC STRUCTURES
Proof. If exactly one error occurs, then e(x) = xl for some i, 0 < i < n. Since p(x) is irreducible, it cannot have a root in Z2. In particular, it cannot divide x1 since the factorization of x1 is unique. It follows that every simple error is detected.
If exactly two errors occur, then
e(x) = x{ + xj = x{ (l + xj~{)
for some 0 < i < j' < n. p(x) does not divide any x1, and since it is primitive, it does not divide 1 + x^~l either, provided j—i < 2m — 1. Atthe same time,p(x) is irreducible, which means that it does not divide the product e(x) = x1 (1+ xi~l), which completes the proof. □
12.4.5. Corollary. Let q(x) be a primitive polynomial of degree m andn < 2m —1. Then the polynomial (n, n—m—1')-code generated by the polynomial p(x) = q(x)(l+x) detects all double errors as well as all errors with an odd number of bit flips.
Proof. The codewords generated by the chosen polynomial p(x) are divisible by both x + 1 and the primitive polynomial q(x). As verified, the factor x + 1 is responsible for parity checking, i.e. all codewords have an even number of non-zero bits. This detects any odd number of errors. By the above theorem, the second factor is able to detect all double errors. □
The following table illustrates the power of the above theorems for several primitive polynomials of low degrees. For instance, the last row says that by adding only 11 redundant bits to a message of length 1012, and employing the polynomial (x+\)p(x), all single, double, triple, and odd-numbered errors in the transfer can be detected. These are already quite large numbers, with over 300 decimal digits.
e., our parametrization of the circle) can be described as:
(':1)-(TT^:^:0 = (a:'2-1:'2 + 1)-
For t ^ 0, we have (t : 1) = (2t2 : 2t), and the original stenographic projection (i. e., the inverse of the above mapping) can be written linearly as
(x : y : z) H> (y + z : x),
which extends our parametrization to the improper point (0 : 1) i-> (0 : 1 : 1). Then, the mapping of Pi (R) onto the circle has the "linear" form
Pi(R) 3 (x : y) ^ (2x : x - y : x + y) e P2(R).
Now, let us look at how simple it is to calculate the formula for the stereographic projection in the projective extensions directly (see 4.3.1): We include PX(R) as points with homogeneous coordinates (t : 0 : 1), and among the linear combinations of point (0 : 1 : —1) (i. e., the pole from which we project) and (x : y : z) (a general point of the circle), we must find the one whose coordinates are (u : 0 : v). The only possibility is the point (x : 0 : z + y), which is our previous formula.
J. Elliptic curve
A singular point of a hypersurface in Pn, denned by a homogeneous polynomial
F(x0,x1,... ,xn) = 0,
is such a point which satisfies      = 0 for i = 1,..., n.
From the geometric point of view, "something weird" happens at the point. In the case of a curve in the projective space P2(R), the condition that the partial derivatives must be zero means that there is no tangent line to the curve at the given point. This means that the curve has the so-called cusp there or it intersects itself. A "nice" singularity can be seen in the "quatrefoil", i. e., the variety given by the zero points of the polynomial (x2 + y2)3 — Ax2y2 v R2:
primitive polynomial p(x)   redundant bits   codeword length
1 + x 1 1
1 + x + x2 2 3
1 + x + x3 3 7
l + x + x4 4 15
l + a;2+a;5 5 31
1 + x + x6 6 63
l + a;3+a;7 7 127
1 + x2 + X3 + x4 + xs 8 255
l + x4 +x9 9 511
1 + X3 + x10 10 1023
Note that quite strong results on the divisibility are used for the decomposition of polynomials derived in the second part of this chapter. But tools which would assist in constructing primitive polynomials are not mentioned.
Such tools come from the theory of finite fields. The name ''primitive'1 reflects the connection to the primitive elements in the Galois fields G(2m). This theory also provides a convenient way of applying the Euclidean division, that is, of verifying whether or not the received word is a codeword,
849
CHAPTER 12. ALGEBRAIC STRUCTURES
The cusp can be found on the curve in R2 given by x3 —
y2 = o.
An elliptic curve C is the set of points in K2, where ] a given field, which satisfy an equation of the form
y
X3 + ax + b,
where a, b e K. In addition, we require that there are no singularities, which means, over the field of real numbers, that
-16(4a3 + 27b2) / 0.
A-
This expression A is called the discriminant of the equation. Note that the right-hand side contains a cubic polynomial without the quadratic term. This form of the equation is called the Weierstrass equation of an elliptic curve.
using the delayed registers. This is a simple circuit with as many elements as is the degree of the polynomial.11
12.4.6. Linear codes. Polynomial codes can also be de-„jmz, scribed using elementary matrix calculus. Re-fti^l call that when working over the field Z2, caution /■■'frnh^ll- *s recluirecl when applying the results of elemen-^5rfSi)P»ll_ tary linear algebra, since then the property that v = — v implies v = 0 is often used. This is not the case now.
However, the basic definition of vector spaces, existence of bases and descriptions of linear mappings by matrices are still valid. It is useful to recall the general theory and its applicability.
Start with a more general definition of codes, which only requires linear dependency of the codeword on the original message:
Linear codes
Any injective linear mapping g : (Z2)k —> (Z2)n is a linear code. The fc-by-n matrix G that corresponds to this mapping (in the canonical bases) is called the generating matrix of the code.
by
For each message v, the corresponding codeword is given
u = G ■ v.
Theorem. Every polynomial (n, k)-code is a linear code.
Proof. Use elementary properties of Euclidean division. Apply the assignment of the polynomial v(x) = r(x) + xn~kra(x) determined by the original message ra(x) to the sum of two messages ra(x) = m1(x) + ra2(x). The remainder in the division xn~k(m1 (x) + ra2(x)) is, by uniqueness, given as the sum r1 (x) + r2(x) of the remainders for the individual messages. It follows that
v(x) = n(x) + r2{x) + /"^[(i) + m2{x)),
which is the desired additivity. Since the only non-zero scalar in Z2 is 1, the linearity of the mapping of the message ra(x) to the longer codeword v(x) is proved. Moreover, this mapping is clearly injective, since the original message ra(x) is simply copied beyond the redundant bits. □
For instance, consider the (6,3)-code generated by the polynomial p(x) = 1 + x + x3, for encoding 3-bit words. Evaluate it on the individual basis vectors rai(x) = x1-1, i = 1,2,3, to get
v0 = (1 + x) + X3, vi = (x + x2) + xA, v2 = (1 + x + x2) + x5.
1 ^ore about the beautiful theory and its connection with codes can be found in the book Gilbert, W., Nicholson, K., Modern Algebra and its applications, John Wiley & Sons, 2nd edition, 2003, 330+xvii pp., ISBN 0-471-41451-4.
850
CHAPTER 12. ALGEBRAIC STRUCTURES
12.J.1. Prove that the curve y2 = a;3 + ax + b in R2 has a singularity if and only if 4a3 — 27b2 = 0.
Solution. The equation of the curve in homogeneous coordinates (see 4.3.1) is F(x, y, z) = 0, where
(1) F(x, y, z) = y2z — x3 — axz2 — bz3.
We have
9F ,22 —— = —3a; — az , ox
OF
9F 2      o of, 2
——   =   y — 2aa;2 — ibz . az
Let [x, y, z] be a singular point of the given curve. If 2 = 0, then since the partial derivatives of F with respect to x and z must be zero, we get x = 0 and y = 0, respectively. However, this is "out", because the point [0,0,0] does not lie in the considered projective space P2 (R). Thus, the singular points has z 0, so that ^ = 0 implies y = 0. Denoting 7 = §, then —3a;2 — az2 = 0 implies 3^ = —a, and y2 — 2aa;z — 3bz2 = 0 implies 2aj = —3b. We can see that the equality a = 0 implies that b = 0, i. e., the equality 4a3 = 27b2 is satisfied trivially. If a ^ 0, then we can express 7 from the two obtained equations. From the first one, we have 7 = —and from the second one, -y2 = — |. Altogether,
f =^ = g^4a3 + 27b2=0.
Thus, we have proved one of the implications. On the other hand, if 4a3+27b2 = 0, then denning 7 = — |^, we can see that the point [7,0,1] satisfies the equation of the elliptic curve:
3bN ' 2af b 3b 2 ~ Y
Thanks to the choice of 7, all the three partial derivatives of F at the point [7 0,1] are zero. □
In order to define a group operation on the points of an elliptic curve, it is useful to consider the curve in the projective extension of the plane (see 4.3.1), and we define a point OeCas the direction (0,1) (which is the point [0,1,0] in the homogeneous coordinates). Then, the addition of two points A, B e C is geometrically denned as the point —C, where C is the third intersection point of the line AB with the elliptic
T2 + aj + b
a
-IT \+b-
2a ,
0.
It follows that the generating matrix of this (6,3)-code is
G =
Polynomial codes always copy the original message beyond the redundant bits. So the generating matrix can be split into two blocks P and Q consisting respectively of n — k and k rows. Then Q equals the identity matrix lk.
12.4.7. Theorem. Let g : (Z2)k -> (Z2)n be a linear code with generating matrix G, written in blocks as
n	0	a
1	1	1
0	1	1
1	0	0
0	1	0
^0	0	V
G =
Then, the mapping h : (Z2)n —> (Z2)n k with the matrix H = P)
has the following properties:
(1) Ker h = Im g,
(2) a received word u is a codeword if and only if H ■ u = 0.
Proof. The composition hog: (Z2)fe —> (Z2)n~fe is given by the product of matrices (computing over Z2):
H ■ G =
= p + p = Q.
Hence it is proved that Im g C Ker h. Since the first n — k columns of H are basis vectors of (Z2)n~fe, the image Im h has the maximum dimension n — k, which means that this image contains 2n~k vectors. Vector spaces over Z2 are finite commutative groups, so the formula relating the orders of subgroups and quotient groups from subsection 12.3.10 can be used, thus obtaining
Ker/i| ■ |Im/i| = |(Z2)n| = 2n.
Therefore, the number of vectors in Ker h is equal to 2n ■ 2k~n = 2k. In order to complete the proof of the first proposition, it suffices to note that the image Img also has 2k elements.
The second proposition is a trivial corollary of the first one. □
The matrix H from the theorem is called the parity check matrix of the corresponding linear (n, fc)-code.
For instance, the matrix H = (1 1 1) is the parity check matrix for the parity check (3, 2)-code, encoding 2-bit words. It is easily obtained from the matrix
G ■
that generates this code.
851
CHAPTER 12. ALGEBRAIC STRUCTURES
curve. If A = B, then the result is given by the other intersection point with the tangent line of the elliptic curve that goes through A.
12. J.2. Prove that the above definition correctly defines an operation on the points of an elliptic curve.
Solution. The intersections of the line with the elliptic curve are obtained as the roots of a cubic equation. If it has two real roots, corresponding to the points A and B, then it must have a third real root as well, i. e., the line AB must have another intersection points with the curve. In the case of a tangent line, the point A corresponds to a double root, so there also exists another intersection point. As for improper points (the last homogeneous coordinate is zero; they correspond to directions in the plane), the only improper point that belongs to the curve given by the equation (1) is the point O = [0,1,0]. Addition with the point O means looking for a second intersection of the elliptic curve (besides the point A itself) and the line which goes through A and is parallel to the y-axis. The improper line z = 0 has triple intersection point O with the curve, i.e.,0 + 0 = 0. □
Remark. Thus, the operation is well-defined. Moreover, it follows directly from the definition that it is commutative. It even follows from the above that O is a neutral element of the operation. However, the proof of associativity is far from trivial.
12.J.3.   Define the above operation algebraically.
Solution. For any point A e C, we define A + O = 0 + A = A.
ForapointA e C,A = (q,/3, 1), we clearly have B e C, and we define A + B = 0, i. e., A = —B.
For the (6,3)-code mentioned above, the parity check matrix is
/l   0   0   1   0 1\ H =   0   1   0   1   1   1 . \0   0   1   0   1 1/
12.4.8. Error-correcting codes. As seen, transferring a message u gives the result
v = u + e.
Over Z2, this is equivalent to e = u + v.
It follows that if the error e to be detected fixed, all the j_-Jj i. received words determined by the correct codewords "       u fill one of the cosets in the quotient space (Z2)n / V, r\^F¥ where V is the vector subspace V C (Z2)n of the ^ codewords. The mapping h : (Z2)n —> (Z2)n_fe corresponding to the parity check matrix has V as its kernel. Therefore, it induces the injective linear mapping h : (Z2)n/K —> (Z2)n_fe. Clearly, the value h(y + V) on the coset generated by v is determined uniquely by the value H ■ v.
Syndromes
The expression H ■ v, where H is the parity check matrix for the considered linear code, is called the syndrome of v.
The following claim is a direct corollary of the construction and the above observations:
Theorem. Two words are in the same class u + V if and only if they have the same syndromes.
It follows that self-correcting codes can be constructed by choosing, for every syndrome, the element of the corresponding coset which is most likely to be the sent codeword. Naturally, when choosing the code, it is desirable to maximize the probability that it can correct single errors (and possibly even more errors).
Try it on the example of the (6,3)-code for which the matrices G and H are already computed. Build the table of all syndromes and the corresponding words.
The syndrome 000 is possessed exactly by the codewords. All words with a given different syndrome are obtained by choosing one of them and adding all the proper codewords.
The following two tables display the syndromes in the first rows; the second rows then display the vector which has the least number of ones among the vectors of the corresponding coset. In almost all cases, this is just one value one there; only in the last column, there are two ones, and the element is chosen where the ones are adjacent (because, for instance,
852
CHAPTER 12. ALGEBRAIC STRUCTURES
For a point A ^ -B, A = [a, (3,1] and a point B e C,
B = [7,5,1], we set
f for A 7^ B,
k   =   I %
[[5pf]2^t«   for ,4 = 5,
a   =   fc2 — a — 7,
r   =   — f3 + k(a — a).
Then, we define A + B = [7, r, 1]. We leave it for the reader to verify that this is indeed the operation that we have defined geometrically. □
K. Grobner bases
12.K.1. Is the basis g\ = x2, g2 = xy + y2 a Grobner basis for the lexicographic ordering x > yl If not, find one.
Solution. Clearly, the leading monomials are LT(gi) = x2, LT(g2) = xy, so the S-polynomial is equal to
S(gu 92) = ygi - xg2 = -xy2.
By theorem 12.5.12, g±, g2 is a Grobner basis if and only if the remainder in the multivariate division of this S-polynomial by the basis polynomials is zero. Performing this division (see 12.5.6), we obtain
S{gi,g2) = ygi -xg2 + yg2 -y3.
The remainder y3 shows that g±, g2 do not form a Grobner basis. By 12.5.13, in order to get one, we must add the remainder polynomial g3 = y3 to gi,g2. Now, we calculate that
8(91,93) = y39i - x2g3 = 0
and
S{g2, g3) = y2g2 - xg3 =y4 = yg3.
Hence it follows by theorem 12.5.12 that gi,g2, g3 is already a Grobner basis. □
12.K.2. Is the basis g\= xy — 2y, g2 = y2 — x2 a Grobner basis for the lexicographic ordering y > xl If not, find one.
Solution. Since LT(gi) = xy and LT(g2) = y2, the corresponding S-polynomials is S(g±,g2) = ygi — xg2 = a;3 — 2y2 = — 2g2 + x3 — 2x2. The leading term a;3 is a multiple of neither xy nor y2, which means that gi, g2 do not from a Grobner basis. We can obtain one by adding the polynomial g3 = x3 — 2x2. Then, we have S{gi,g3) = x2gx -yg3 = 0
and
multiple errors are more likely to be adjacent).
000	100	010	001
000000	100000	010000	001000
110100	010100	100100	111100
011010	111010	001010	010010
111001	011001	101001	110001
101110	001110	111110	100110
001101	101101	011101	000101
100011	000011	110011	101011
010111	110111	000111	011111
			
110	011	111	101
000100	000010	000001	000110
110000	110110	110101	110010
011110	011000	011011	011100
111101	111011	111000	111111
101010	101100	101111	101000
001001	001111	001100	001011
100111	100001	100010	100101
010011	010101	010110	010001
All the columns in the tables are affine subspaces whose modelling vector spaces are always the first column of the first table. This is because the code is linear, so that the set of all codewords forms a vector space, and the individual cosets of the quotient space are consequently affine subspaces.
In particular, the difference of each pair of words in the same column is a codeword. The words in bold are the leading representatives of the coset (affine space) that correspond to the given syndromes. These are the words with the least number of ones in their column. They correspond to the least number of bit flips which must be made to any word in the given column in order to get a codeword.
For instance, if a word 111101 is received, compute that its syndrome is 110. The leading representative of the coset of this syndrome is 000100. Subtract it from the received word, to obtain the codeword 111001. This is the codeword with the least Hamming distance from the received word. So the original message is most likely to be 001.
5. Systems of polynomial equations
In practical problems, objects or actions are often en-countered which are described in terms of poly-, '■ ' nomials or systems of polynomial equations.
' Sj^P^-lor instance, the set of points in R3 defined by 31k«*     two equations a;2 + y2 — 1 = 0 and z = 0 is the circle which is centered at (0,0,0), has radius 1 and lies in the plane xy.
Similarly, equations xz = 0 and yz = 0 considered in R3 define the union of the line x = 0, y = 0 and the plane 2 = 0. Notice we have to specify the space carefully, since x2+y2 = 0 defines a circle in R2, but it is a cylinder if viewed inR3.
853
CHAPTER 12. ALGEBRAIC STRUCTURES
S(g2,5s) = x3g2 - y2g3 = 2y2x2 - x5 =
= (Ay + 2xy)gx - (x2 + 2x + A)g3 + 8g2. □
12.K.3.   Eliminate variables in the ideal
I = (x2 + y2 + z2 - 1, x2 + y2 + z2 - 2x, 2x - y - z).
Solution. The variable elimination is obtained by finding a Grobner basis with respect to the lexicographic monomial ordering. Let us denote the generating polynomials of I as 5i552,53, respectively. The reduction g2 = g\ + 1 — 2x yields the reduced polynomial /1 = 2x — 1. Now, we use this polynomial to reduce 53 = /1 + 1 — y — z to f2 = y + z — 1. Now, we reduce g\, dividing it by /1 and f2, which leads to
51 = (\x + \)h+y2 + z2 -1
and
y2 + z2 - 1 = (y - z + 1)/2 + 2z2 -2z+1-.
Hence, f3 = 8z2 — 8z + 1. We can see that we could do with polynomial reduction and we did not have to add any other polynomials. The basis of I with eliminated variables
isl= (2x-l,y + z-l,8z2 -8z + l). □
12.K.4. Solve the following system of polynomial equations:
x2y — z3 = 0, 2xy — Az = 1,
« - y2 = 0,
X'3 _ 4.yZ = 0.
Solution. Using appropriate software, we can find out that the corresponding ideal
(x2y — z3, 2xy — Az — 1, z — y2, x3 — Ayz)
has Grobner basis (1} with respect to the lexicographic monomial ordering, which means that the system has no solution.
□
12.K.5.   Find the Grobner basis of the variety in R3 denned
"""" "        y   =   3v + 3u2v — v3, z   =   3u2 — 3v2.
This is the so-called Enneper surface, and it is depicted in the picture on page 855.
Deciding whether or not a given point lies within a given body, finding extrema of algebraically denned subsets of multidimensional spaces, analyzing movements of parts of some machine, etc. are examples of such problems.
12.5.1. Afline varieties. For the sake of simplicity (existence of roots of polynomials), the work is mainly with the field of complex numbers. Some ideas are extended to the case of a general field K.
An affine n-dimensional space over a field K is understood to be K" = K x ■ — x K with the standard affine struc-
n
ture (see the beginning of Chapter 4).
As seen, a polynomial / = J2a aaXa £ K[x1,..., xn] can be viewed naturally as a mapping /: K" —> K, defined by
/(«!,..., un) := ^ aaua, where ua = u"1 ■ ■ ■ u"n.
Q
In dimension n = 1, the equality f(x) = 0 describes only finitely many points of K. In higher dimension, the similar equality f(x1,..., xn) = 0 describes subsets similar to curves in the plane or surfaces in the space. However, they may be of quite complicated and self-intersecting shapes.
For instance, the set given by the equation (x2 + y2)3 — Ax2y2 = 0 look as a quatrefoil (see the illustration in the beginning of the part J in the other column). Another illustration of a two-dimensional surface is given by Whitney's umbrella x2 — y2z = 0, which, besides the part shown in the diagram, also includes the line {x = 0, y = 0}.
The diagram was drawn using the parametric description x = uv, y = v, z = u2, whence the implicit description x2 — y2z = 0 is easily guessed.
854
CHAPTER 12. ALGEBRAIC STRUCTURES
3u2 - 3v2
Solution. Applying the elimination procedure (e. g. intthe In the following illustration, there is the Enneper surface
MAPLE system, using gbasis with plex ordering), weob- with parametrization a; = 3u+3uv2-u3,y = 3v+3u2v-v3, tain the corresponding implicit representation, i. e., an equation with a single polynomial of degree nine:
- 59049« - 104976z2 - 6561y2 - 72900z3 - 18954y2z-
- 23328z4 + 32805z2a;2 + 14580z3a;2 + 3645z4a;2 - 1296y4z-
- 16767y2z2 - 6156y2z3 - 783y2z4 + 39366za;2 + 19683a;2-
1296y4 - 2430z5 + 432z6 + 108z7 + 486z5a;2
432y4z2+
+ 54y2z5 + 27z6x2 - 48y4z3 + 15y2z6
64y6
□ As we illustrate on the following simple exercise, the Grobner basis can be used for solving integer optimization problems.
12.K.6. What is the minimum number of banknotes that are necessary to pay 77700 CZK? Solve this problem for three scenarios: First, assume that the banknotes at disposal are of values 100 CZK, 200 CZK, 500 CZK, 1000 CZK. Then, assume that there are also banknotes of value 2000 CZK. Finally, assume that there are no banknotes of value 2000 CZK, but there are banknotes of value 5000 CZK.
Solution. Let us denote the respective banknotes by variables s,d,p,t, D, P. The banknotes to be used will be represented as a polynomial in these variables so that the exponent of each variable determines the number of the corresponding banknotes. For instance, if we decide to use only the 100 CZK banknotes, then the polynomial will be q = srrr. If we pay with ten 1000 CZK banknotes, ten 500 CZK banknotes, and the remaining amount with 100 CZK banknotes, then the polynomial will be q = t10p10s627. In the former case, the number of banknotes will be 777. In the latter case, it will be 10 + 10 + 627 = 647.
If we have only the banknotes s, d,p, t, then the ideal that describes the relation of the individual banknotes is
Ix = {s2 -d, s5 -p,s10 -t).
In order to minimize the number of used banknotes, we compute the Grobner basis with respect to the graded reverse lexicographic ordering (we want to eliminate the small banknotes):
d = (p2 - t, s2 - d, d3 - sp, sd2 - p).
Now, we take any polynomial that represents a given choice of banknotes. Reducing this polynomial with respect to the basis Gi, we get a polynomial whose degree is minimal for our
JTTTTJT
100 - 50   0   50 100
It is hard to imagine how to obtain the implicit description from this parametrization in hand. Nevertheless, there is an algorithm to eliminate the variables u and v from these three equations. Quite a complex theory is needed to be developed for that. As usual, begin by formalization of the objects of interest.
Affine varieties
Let /i,..., fs £ K[a;i,..., xn\. The affine variety in Kn corresponding to the set of polynomials f\,..., fs is the set
»(/!,...,/,) = {(a1,...,an)eK";
fi(ai,... ,an) = 0, i = 1,.. . ,s}
- 5
Affine varieties include conic sections, quadrics, and hy-perquadrics, both singular and regular. Many curves and surfaces can be easily described as affine varieties.
The variety corresponding to a set of polynomials is the intersection of the varieties corresponding to the individual polynomials. For instance, 2J(a;2 + y2 — 1, z) £ R3 is the circle which is centered at (0,0,0), has radius 1 and lies in the plane xy.
Similarly, 2J(a;z, yz) £ R3 is the union of the line x = 0, y = 0 and the plane 2 = 0, since it is exactly the points of these two objects where both the polynomials xz, yz are zero.
These examples illustrate that it is not easy to deal with the concept of dimension. Is the mentioned line, added to the plane, enough for the variety to be considered three-dimensional, or should one keep considering it two-dimensional with a certain anomaly?
Verify the following straightforward proposition:
855
CHAPTER 12. ALGEBRAIC STRUCTURES
monomial ordering, at it is easy to show that it is the polynomial corresponding to the optimal choice. For instance, take q = srrr. Reduction with respect to Gi yields trrpd. This means that the optimal choice is seventy-seven 1000 CZK banknotes, one 500 CZK banknote, and one 200 CZK banknote. Altogether, it is 79 banknotes. In the second scenario, when we also have the banknote D, the ideal is I2 = (s2 — d, s5 — p, s10 — t, s20 — D) and its Grobner basis is
G2 = (t2 -D,p2 -t,s2 - d,d3 - sp,sd2-p).
Reduction of q = srrr with respect to G2 gives D3Stpd, so this time we pay with 41 banknotes. In the third scenario, we have J3 = (s2 - d, s5 - p, s10 - t, s50 - P) and
G3 = (t5 -P,p2 - t, s2 -d,d3 - sp, sd2 - p),
and the reduction is equal to P15t2pd. In this case, we need only 19 banknotes.
Of course, this simple problem can be solved quickly with common sense. However, the presented method of Grobner bases gives a universal algorithm which can be automatically used for higher amounts and other, more complicated cases.
□
Grobner bases have applications in robotics as well. In particular, it is in inversion kinematics, where one must find how to set individual joints of a robot so that it could reach a given position. This problem often leads to a system of nonlinear equations which can be solved by finding a Grobner basis, as in the following problem.
12.K.7. Consider a simple robot, as shown in the picture, which consists of three straight parts of length 1 which are connected with independent joints that enable arbitrary angles a, 0,7. We want this robot to grasp, from above, an object which lies on the ground in distance x. What values should the angles a, 0,7 be set to? Draw the configuration of the robot for x = 1, l,5a-\/3.
Solution. Consider a natural coordinate system, where the initial end of the robotic hand lies in the origin, and the ground corresponds to the x-axis. It follows from elementary trigonometry that the total x-range of the robot at angels a, (3,7 is equal to
x = sin a + sin(a + 0) + sin(a + 0 + 7).
Similarly, the range of the robot in the vertical direction is
y = cos a + cos(q + 0) + cos(q + 0 + 7).
Theorem.  Let V    =    2J(/i,..., fs)  and W = ..., gt) C Kn be affine varieties.   Then, V U W and V fl W are also affine varieties, where
V C\W =        ... ,fs,gi,... ,gt),
V U W = V3(figj)  for\<i<s,\<j <t.
In the following subsections, some questions which arise in the context of varieties are answered:
Ql. Is the set       , • • •, fs) empty? Q2. Is the set 2J(/i ,...,fs) finite?
Q3. How to understand the concept of dimension for varieties?
All of these problems can be "reasonably" solved for varieties over the complex numbers (as well as over any algebraically closed field); it is more difificult for the real numbers and nearly impossible for general fields. For instance, over the rational numbers, the question whether (xn +yn— zn) = 0 leads to the well-known Fermat's last theorem, many times mentioned in Chapter 11.
12.5.2. Parametrization. For some purely practical operations with varieties, it is convenient to use the implicit representation (the one used so far). For instance, deciding whether a given point lies in a given variety, or inside the space enclosed by it, is quite easy using the implicit description. On the other hand, the parametric description may also come in handy in many situations (for example, it was used to draw the diagram).
The variety 2J(a; + y + z — l,x + 2y — z — 3) is a line (the intersection of two planes). If the system
x + y + z — 1 = 0, x   +   2y   -   2-   3   = 0,
is solved, the parametric description of this line is immediate:
x = —1 — 3t, y = 2 - 2t, z = t
as known from the affine geometry. One needs to be more careful in general:
Rational parametrization
Definition. A rational parametric representation of a variety • • •, fr)  - K" is a set of rational functions
77,..., rn e K(ii,..., ts) such that:
• for each s-tuple 11,..., ts, the point (x1,..., i„) defined by Xi = ri(ti,...,ts) for i = 1,2,... ,n, is in
9J(/i,---,/r);
• , • • •, fr) is the minimal affine variety which contains these points (x1,..., xn).
Note that the parametrization is not required to describe all the points of the variety. This is important, as seen by a
856
CHAPTER 12. ALGEBRAIC STRUCTURES
The condition of grasping the object from above is clearly equivalent to the condition a + j3 + 7 = it, so the statement of the problem leads to the system
sin a + sin(a + 0) = x, cos a + cos(a + 0) — 1 = 0.
In order to transform this system to a system of polynomial equations, we introduce new variables s1 = sin a, c1 = cos a, s2 = sin 0 c2 = cos (3, which of course satisfy s\ + c2 = 1 and s\ + c| = 1. Using basic trigonometric equalities for sums in arguments, we obtain the following, equivalent system of polynomial equations:
si + sic2 + C1S2 — x = 0, Cl + CiC2 - SiS2 -1 = 0,
s2 + c2 - 1 = 0,
s2 + d - 1 = 0.
The Grobner basis of the corresponding ideal can be found in a while using appropriate software. For the graded reverse lexicographic ordering s1 > c1 > s2 > c2, we get the basis
(2c2 + 1 - x2, 2ci(1 + x2) - 2s2x -1-x2, 2si(l + x2) +
+2s2 - a; - x3,4s| - 3 - 2a;2 + a;4),
and hence it is easy to calculate the values of the variables in dependence on x. For example, we can immediately see that c2 = ^-g—-, i. e., j3 = arccos(£-p^). In particular, it is clear that the problem has no solution for a; > \/3. Specifically, for I a; I < \/3, there are 2 solutions, and for | x = \/3, there is one solution (a = |, (3 = 0, 7 = ^ forpositive a;; a = —-|, /3 = 0, 7 = 4^ for negative a;). For a; = 1, we get the solution a = 0, /3 = §, 7 = § and the degenerated solution a = -|, /3 = — f. 7 = 71 ■ The case a; = —1 is similar. It is good to realize that for |a; < 1, one of the solutions will always correspond to a configuration of the robot where some parts will intersect. For these values of x, there will be only one realizable configuration.
simple example of parametrization of a circle in the plane:
2t       _ -1 + t2
x ~ T+l2' y~ 1 + t2 '
which can be obtained using the stereographic projection. (Verify this in detail!) Note that this parametrization describes all points except for the point (0,1), from which we project, since this point is not reachable for any value of the parameter t. This is nobody's fault; it follows from the different topological properties of the circle and the line that there exists no global bijective rational parametrization.
In this connection, two more questions arise: Q4. Does there exist a parametrization of a given variety and
how to find it?
Q5. Is there an implicit description of a parametrically defined variety?
The general answer to question 4 is negative. In fact, most affine varieties cannot be parametrized; or at least there is no algorithm for parametrization of the implicit description.
It is clear at first sight that a given variety may admit many implicit and parametric descriptions. In the case of implicit descriptions, it is given by representation in terms of several "generating" polynomials, and there is clearly much freedom in their choice. Once a parametrization is found, it can be composed with any rational bijection on the parameters in order to obtain another one.
12.5.3. Ideals. In order to avoid the dependence on the cho-
sen equations that define a variety, consider all consequences of the given equations. This leads to the following algebraic concept of subsets in rings (which is similar to normal subgroups):
Ideals
Definition. Let K be a commutative ring. A subset I C K is called an ideal if and only if 0 G I and
f,gel ==> f + gel, f e l,h e K ==> f-hel.
Since the definition contains only universal quantifiers (the properties are required for all elements in K or 7), the intersection of two ideals is also an ideal. Consequently, ideals can be considered generated by subsets. Use the notation I = (ai,..., an). It is easy to prove that such an ideal is
1= |^a2&2, h GKJ,
i
where only finite sums are considered. (Check that this is the intersection of all ideals containing the set of the generators!) The set of generators may be infinite, too. If there are only finitely many generators, the ideal is said to be finitely generated.
It is easy to verify that each variety defines an ideal in the ring of polynomials in the following way:
857
CHAPTER 12. ALGEBRAIC STRUCTURES
□
Grobner bases can also be used in software engineering when looking for loop invariants, which are needed for verification of algorithms, as in the following problem.
12.K.8. Verify the correctness of the algorithm for the product of two integers a, b.
(x,   y,   z)   :=   (a,   b,   0) ;
while not   (y = 0) do if y mod 2=0
then   (x,   y,   z)   :=  (2*x,   y/2, z)
else   (x,   y,   z)   :=  (2*x,   (y-l)/2, x+z)
end if
end while
return z
Solution. Let X,Y,Z denote the initial values of the variables x, y, z, respectively. Then, by definition, a polynomial p is an invariant of the loop if and only if we have p(x, y, z,X,Y,Z) = 0 after each iteration. Such a polynomial can be found using Grobner basis as follows: Let /i, J2 denote the assignments of the then- and else-branches, respectively, i. e.,
1 y-1 fi(x,y, z) = (2x,-y,z) and f2(x, y, z) = (2x, ——, x+z)
The ideal of a variety For a variety V =       ,... ,/<,), set 3(V) := {f eK[Xl,...,xn};
f(au ...,an) = 0,V (ai, ...,an)e V).
Lemma. Let fi,
nomials. Then:
(1) if(h,...Js) =
= 23(si,---,st);
(2) 3(V) is an ideal, and (/i,
fs,9i, ■ ■ ■ ,9t G K[xi,. ..,xn] be poly-^t), then 2J(/i,...,/s) = ...,/,) C 3(V), where V =
Proof. If a point a = (a1,..., an) lies in a variety • • •, fs), then any polynomial of the form
/ = fti/i + --- + hsfs
(i.e. any member of the ideal I = (f±,..., fs)) takes zero at a. In particular, this means that all the polynomials gi take zero at a. Hence
■■■,/.) C2J(5i,...,5t).
The other inclusion can be proved similarly.
In order to verify the second proposition, choose g, g' e ^(V), h e K[x1, ..., xn]. Then, for any point a e V that
(gh)(a) = 0^ghe3(V), (g + g')(a) = o^5 + 5' e3(V).
Hence 3(V) is an ideal in K[x1,..., xn].
For any polynomial / = h1f1 + ■■■ + hsfs e (/i,..., fs) and a point a e V, f(a) = 0, which proves the desired inclusion. □
The simplest examples are trivial varieties - a single point and the entire affine space:
3({(0,0,...,0)}) = {xu...,xn),
3(Kn) = {0}   for any infinite field K.
The other inclusion of the second part of the lemma does not hold in general. For instance, the variety 2J(a;2, y2) contains only the single point - (0,0). This means that 3(V) = (x,y) D (x2,y2).
If V, W C K" are varieties, then
V C W =^ 3(V) D 3(W).
In other words, a polynomial which takes zero at each point of a given variety clearly takes zero at each point of any of the variety's subsets.
Now, further natural problems can be formulated: Q6. Is every ideal I G K[xi,..., xn] finitely generated? Q7. Is   there   an   algorithm   which   decides whether
f e </i,...,/.>7
Q8. What is the precise relation between (/i,..., fs) and 3(X(fi,...,fs))?
858
CHAPTER 12. ALGEBRAIC STRUCTURES
For n iterations of the first one, we immediately obtain the explicit formula fi(x, y, z) = (2nx, A-y, z). In order to transform this iterated function to a polynomial mapping, we introduce new variables u := 2n, v := A.. Then, /" is given by the polynomial function
Fi :    x h> ux   y h> vy   z h> z,
where the new variables satisfy uv = 1. Clearly, the invariant polynomial must lie in the ideal
1\ = {ux — X, vy — Y, z — Z, uv — 1).
In order to find such polynomial, it suffices to eliminate the variables u and v, which can be done just with the Grobner basis with respect to the graded reverse lexicographic ordering with u>v>x>y>z. This basis is equal to
(xy — XY, z — Z,x — vX, y — uY).
Hence F^xy - XY) = xy — XY ans F\(z — Z) = z — Z, and all other polynomials are invariant with respect to any number n of applications of f\ and are given by a polynomial in (polynomials) xy — XY and z — Z. Now, we proceed similarly for f2. For n iterations, we derive the formula
fi(x,y, z) = (2nx, ±;(y + 1) - 1, (2n - l)x + z),
and introducing the variables u and v, we get an equivalent polynomial function
F2 :    x h> ux   y h> v(y + 1) — 1   z h> (u — l)x + z.
The invariant polynomial for F2 can be obtained similarly as above, thanks to the Grobner basis of the corresponding ideal. However, we are interested in those polynomials which are invariant for both Fi and F2. Clearly, these must lie in the ideal
h = (F2(xy-XY),F2(z - Z),uv - 1). Substituting for F2, we obtain
I2 = {uxv(y + 1) — ux — XY, (u — l)x + z — Z, uv — 1}
and with the Grobner basis of this ideal, we eliminate the variables u and v and thus find the polynomial xy — XY + z — Z, which is invariant for both Fi and F2, so it is an invariant of the given cycle. Since at the beginning we have X = a, Y = b, Z = 0, we can see that it holds in every step of the algorithm that xy — ab + z = 0. Since the loop terminates only if y = 0, we get that indeed z = ab. □
12.5.4. Dimension 1. Consider univariate polynomials first
/ = a0xn + a\xn~x + ■ ■ ■ + an,   where a0 ^ 0.
The leading term of a polynomial is defined to be LT(f) := a0xn. Clearly,
deg/<deg5 <s=* LT(f)\LT(g).
Let K be a field and g a non-zero polynomial. Every polynomial f eK[x] can be written in a unique way as
/ = Q ' 9 + ri   where r = 0 or deg r < deg g.
In fact, the quotient q and the remainder r can be computed by the following algorithm:
(1) j:=0,r:=/
(2) while r ^ 0 ALT(g)\LT(r)
(a) q:=q + LT(r)/LT(g)
(b) r:=r-LT{r)/LT{g)-g
When checking the loop condition, the invariant / = q-g + r holds, so the algorithm answers correctly as soon as the loop condition becomes "false". Since the degree of r is decreasing, the algorithm eventually terminates.
Corollary. Let Kbe a field. Then, every ideal in the polynomial ring K[x] is of the form (/}.
Proof. Consider an ideal I C K[x]. HI = {0}, then the ideal is generated by the zero polynomial. If I contains a non-zero polynomial /, then choose any with the lowest degree. Clearly (/} C I.
For any polynomial g e I, consider the Euclidean division of g by /, i.e. g = qf + r. Clearly, qf G 7, which means that r £ las well. However, the degree of / is as small as possible, so r = 0. Therefore, g is a multiple of /, and/=</). □
An ideal which is generated by a single element is called a principal ideal. A ring which has the property of the last lemma is called a principal ideal domain.
Recall the Euclidean algorithm for the greatest common divisor h = GCD(f, g) of polynomials / and g (the variable h contains the desired greatest common divisor when the algorithm terminates):
(1) h:=f,s:=g
(2) while s^0
(a) r := h mod s
(b) h:=s
(c) s := r
Let / = q ■ g + r and h = GCD(f, g). Then, h\r, g and
Vp £ K[x]: p\r, g,   so p\f andp\h.
Hence, h is GCD(r,g). Trivially, GCD(h,0) = h, so the algorithm computes GCD(f, g) correctly. Since the degree of r is strictly decreasing, the algorithm eventually terminates.
It follows that each pair of polynomials has a greatest common divisor. It is unique up to multiplication by a scalar. Indeed, if two polynomials are greatest common divisors of a given pair of polynomials, then they must divide each other,
859
CHAPTER 12. ALGEBRAIC STRUCTURES
Now, we present several exercises where we use Grobner bases to solve various polynomial systems. The primary goal will not be to find the Grobner basis, but rather to solve the given system.
12.K.9.   Using Grobner basis, solve the polynomial system
x3 - 2xy = 0, x2y + x - 2y2 = 0.
Solution. Let us denote f\ := x3 — 2xy, f2 := x2y + x — 2y2. The basis (f±, f2) is not a Grobner basis since, e. g.,
LM(yh-xf2) = x2 £ (x3,x2y) = (LM(fi), LM(f2)). Thus, we have to add the polynomial
yh - xf2 = -x2.
The resulting basis can be reduced using division of the polynomials /i, /2 by x2. This leads to the basis
(xy,x- 2y2,x2) .
Now, we can divide the first polynomial by the second one, with remainder 2y3, and the third one by the second one, with remainder 4y4. Thus, we get the basis
(x-2y2,y3),
and this is a Grobner basis: by the naive algorithm (see 12.5.13), it suffices to verify that the polynomial
S(x- 2y2,y3) = y3 (x - 2y2) - xy3 = -2y5
gives zero remainder with respect to the basis (x — 2y2 ,y3), even for any polynomial ordering. Clearly, the solution of the system is the point (0,0). □
12.K.10. Consider the following system of polynomial equations:
x2yz2 + x2y2 + yz — xyz2 — z2 = 0, x2y + 2 = 0, xyz + z + l = 0.
Sort the monomials of the polynomials with respect to the lexicographic ordering with x > y > z, then divide the first polynomial by the second one and the third one, and use the result in order to solve the system in the real numbers.
Solution.
22,22 2, 2       /     ,     2\ /   2     , \
x y + x yz — xyz +yz — z = (y + z ) (x y + z) —
— yixyz + 2 + 1) — z3 +
which is, for polynomials, possible only in the case mentioned.
The greatest common divisor of more polynomials is defined as follows: If s > 2, then
GCD(fu ...,/.):= GCD{h,GCD(h, ...,/.)).
Lemma.   Let   fi,...,fs   be   polynomials. Then,
{GCD(f1,...,fs)) = {f1,...,fs).
Proof. GCD(f±fs) divides all the polynomials f. Hence the principal ideal (GCD(fi,..., fs)) is contained in the ideal • • •, /s). The other inclusion follows immediately from Bezout's identity. □
Earlier, eight questions were formulated. Here are some answers for dimension 1:
. Since »(/!,. ..,/.) = nGCD(h,/,)), the problem of emptiness of a given variety reduces to the problem of existence of a root of a single polynomial.
• For the same reason, each variety is a finite set of isolated points - the roots of the polynomial GCD(f±,..., fs), except for the case when GCD(f±,..., fs) = 0. This can happen only if /i = /2 = ■ ■ ■ = fs = 0, and then the variety is the entire K.
• The concept of dimension is not of much interest in this case; each variety has dimension zero, being a discrete set of points.
• Each ideal can be generated by a single polynomial.
• /e (h,...,/.) ^ GCD(h,...Js)\f.
• Denoting (/} := 3(9J(/i,...,/.)), then / and GCD(fi,..., fs) may differ only in root multiplicities.
12.5.5. Monomial ordering. In order to generalize the Eu-tf —, clidean division of polynomials for more vari-£>j&ff^y ables, one must first find an appropriate analogy CmjJ*s,i~__ for the degree of a polynomial and its leading term.
The Euclidean division of a polynomial / G
K[x1,...,xn] by polynomials g±,...,gs is to be an expression of the form
/ = aigi H-----h asgs + r
where no term of the remainder r is divisible by the leading term of any of the polynomials gi.
Try this with / = x2y + xy2 + y2, g1 = xy — 1, and g2 = y2 — 1. The first division yields
/ = (x + y) ■ gi + (x + y2 + y).
LT(y2 — 1) does not divide x (the leading term of the remainder), so, theoretically, continuation is not possible.
However, x can be moved into the remainder, thus obtaining the result
/ = (x + y)-gi+g2 + (x + y + l).
No term of the remainder is divisible by either LT(gi) or .LT(g2). How are the leading terms determined?
860
CHAPTER 12. ALGEBRAIC STRUCTURES
Hence 2 = 0, ±1. Then, e. g.,
0 = 2 (x2y + 2) — x(xyz + 2 + 1) =
= 22 — ZX — X. 2
Hence, x = -^-j-, and we get from the third equation that y = —        ■ This is satisfied by the sole point (§,— 4,1). □
12.K.11.   Using Grobner basis, solve the polynomial system
x2 + y + 2 = 1, a; + y2 + 2 = 1, x + y + 22 = 1.
Solution. Let us denote /1 :=a; + y + 22 — 1. The division
of x + y2 + 2 — 1 by /1 gives
J2 = y2 - y - z2 + 2.
The division of a;2 + y + 2 — 1 by f± yields y2 + 2y22 — y + 24 — 222 + 2, and further division by f2 produces the remainder.
2
h
2yz2 + 24
However, (f±, /2, /a) is not a Grobner basis yet. That will be constructed by the choice g\ := f±, g2 := /2 and replacing /3 with the S'-polynomial
222/2 - y/3 = -y24 - yz2 - 2z4 + 2z3.
Then, the division by the polynomial    leads to
g4 = 26 - 424 + 423 - 22 = = 22(2-l)2 (22 + 22 -l).
Now, (51,52,53) is a Grobner basis, so we can solve the system by elimination. We get from g4 = 0 that 2 = 0,1, —1 ± V2. Substituting this to g2 = Oandgi = 0 gives the solutions
(1,0,0),(0,1,0),(0,0,1),(-1 + y/2,-1 + y/2,-1 + V2), (-1 - v"2, -1 - v"2,-1 - v"2). □
12.K.12. Solve the following system of polynomial equations ij)
x2 — 2xz — 4, x2y2z + yz3, 2xy2 - 3z3.
Solution. The basis suitable for variable elimination is the Grobner basis for the lexicographic monomial ordering with
Monomial ordering
A monomial ordering on K[xi,..., xn] is a well-ordering (every non-empty subset has a least element) < on Nn which satisfies
Vq, /3,7 e Zn : a < /3 =>■ a + 7 < /3 + 7.
An ordering on Nn induces an ordering on monomials as soon as the order of the variables x1 < x2 < ... xn is fixed.
Each polynomial can be rearranged as a decreasing sequence of monomials (ignoring the coefficients for now).
The following three definitions introduce the most common monomial orderings. Each ordering assumes that the order of the variables is fixed, usually x1 > x2 > ■ ■ ■ > xn.
Definition. Let a, (3 e Nn. The lexicographic ordering <iex is defined by
a >iex P       the left-most non-zero term of a — j3 is positive.
The graded lexicographic ordering <griex is defined by
q >grlex P  <^ H > \P\ Or
|q| = \P\   and a >iex j3.
The graded reverse lexicographic ordering <greviex is defined by
^grevlex $
\a\ > \p \a\ = \P\
or
and the right-most nonzero term of a — j3 is negative.
If x > y > 2, then x >greviex y >greviex z, but x2yz2 >griex a;y32, yet a;2y22 <greviex a;y32.
Verify in detail that >iex, >griex, >greviex are monomial orderings!
12.5.6. Multivariate division with remainder. Consider a
non-zero polynomial / = J3qgN" aaXa in K[x1,... ,xn] and < a monomial ordering. Then define the degree, leading coefficient, leading monomial, and leading term of / as follows:
• multideg/ := max{a G W1, aa 0},
• LC f := amuitideg/,
• LMj:= a;™ltideg/;
• LTf := LC/■ LM f.
Of course, these concepts depend on the underlying monomial ordering.
Lemma. Let f, g G K[a;i,..., xn] and < be a monomial ordering. Then,
(1) multideg(/- g) = multideg / + multideg g,
(2) f + g^0
multideg(/ + g) < max{multideg /, multideg g}.
Proof. Both claims are straightforward corollaries of the definitions. □
861
CHAPTER 12. ALGEBRAIC STRUCTURES
x > y > z. Using Maple, we can find the basis
UAz5 + 35z7 + 12z9, 23z6 + 12z8 + Uyz4, yz3 + 3z5 + Azy2, 9z4 + Ay3, -8y2 - 6z4 + 3xz3, 2xy2 - 3z3, x2 - 2xz - 4.
Since the discriminant of the first polynomial of the basis (divided by z5) satisfies 352 - 4 ■ 144 ■ 7 < 0, we must have 2 = 0. Substituting this into the other polynomials, we immediately obtain y = 0, a; = ±2. □
12.K.13. Solve the following system of polynomial equations in R:
xy + yz- 1, yz + zw — 1, zw + wx — 1, wx + xy — 1.
Solution. In this case, it is a good idea to take the graded lexicographic ordering with w > x > y > z. Using the algorithm 12.5.13 or appropriate software, we find the corresponding Grobner basis
(x - z, w - y,2yz - 1).
Thus, the system is satisfied by exactly the points (^, t, A; t) for an arbitrary t G R except zero. □
12.K.14. Solve the following system of polynomial equations in R:
x2 + yz + x, z2 + xy + z, y2 + xz + y.
Solution. Using the algorithm 12.5.13 or appropriate software, we find the corresponding Grobner basis for the lexicographic monomial ordering with x > y > z, consisting of six polynomials:
z2 + 3z3 + 2z4,
z2 + z3 + 2yz2 + 2yz3,
V - yz - z - z2 - 2yz2 + y2,
yz + z + z2 + 2yz2 + xz,
z2 + xy + z, x2 + yz + x.
The roots of the first polynomial are z = 0, —1, —-|. Discussing the individual cases, we find out that the  system is  satisfied  exactly  by  the points
Theorem. Let < be a monomial ordering and F = (/i,-..,/s) be an s-tuple of polynomials in K[xi,..., xn\. Then, every polynomial f G K[xi,..., xn] can be expressed as
f = ai/i H-----\-asfs +r,
where a„r £ K[xi,..., xn] for alii = 1,2,... ,s. Moreover, either r = 0 or r is a linear combination of monomials none of which is divisible by any ofLT fi,..., LT fs, and if aifi 7^ 0, then multideg / > multidegai/i for each i.
The polynomial r is called the remainder of the multivariate division f jF.
Proof. The theorem says nothing about uniqueness of the result. The following algorithm produces a possible solution and thus proves the theorem.
In the sequel, consider the output of this algorithm to be the result of the division.
(1) ai := 0,. .. ,as := 0,r := 0,p := /
(2) while p=£0
(a) i := 1
(b) d :=false
(c) while i < s A not d
(i) if LTf\LTp
a{ := a{ +LTp/LT f p:=p-{LTp/LTf)-f d := true
(ii) else i := i + 1
(d) i f not d
(i) r := r + LTp
(ii) p := p — LTp
In every iteration of the outer loop, exactly one of the commands 2(c)i, 2(d)ii is executed, so the degree of p decreases. Therefore, the algorithm eventually terminates.
When checking the loop condition, the invariant / = fli /i + ■' -+p + r holds, and each term of each a{ is the quotient LT p/LT fi from some moment. The degrees of these terms are less than the current degree of p, which is at most the degree of /. Altogether, the degree of each aif is at most the degree of /. □
In the ring K[x1,... ,xn], the following implication clearly holds:
/ = ai/i + ■ ■ ■ + asfs + 0 / G (/i,..., fs).
However, the converse is generally not true for multivariate division:
Consider / = xy2 — x, f1 = xy + 1, /2 = y2 — 1. The algorithm outputs
/ = y(xy + 1) + 0(y2 - 1) + {-x - y),
but/ = a:(y2-l),sothat/G(/i,/2}.
The next goal is to find some distinguished generators of the ideals / = (/i,..., fs) which would behave better. In a certain sense, this is a similar procedure to the Gaussian elimination of variables for systems of linear equations. Begin with some special assumptions about the ideals.
862
CHAPTER 12. ALGEBRAIC STRUCTURES
(0,0,0), (-1,0,0), (0,-1,0), (0,0,-1) and (-5,-5,-5).   12.5.7. Monomial ideals. An ideal I C K[xx,..., xn] is
□ called monomial if and only if there is a set of
JQX    multi-indices a C N™ such that I is generated
-<■ _^3Cjr"     ^e monomials x<x with ci6 A
£3«^* "      This means that all polynomials in I are of the form ^aeA haxa, where ha £ K[xi,..., xn].
Clearly, for a monomial ideal I x13 £ / if and only if there exists an a £ A such that xa divides x13.
Lemma. Let I C K[xi,... ,xn] be a monomial ideal and f £ K[xi,..., xn] a polynomial. Then, the following propositions are equivalent:
(1) fel,
(2) each term of f lies in I;
(3) the polynomial f is a linear combination of monomials from I with coefficients from K.
Proof. The implications (3) => (2) =>■ (1) are obvious. It remains to prove (1) => (3).
Write the polynomial / as / = J2a ac,xa, where aa £ K. It follows from the assumption / £ I that / = J2peA hpx13, where x13 e I and hp £ K[x±,..., xn]. Each term aaxa must equal some term of the other equality. Hence each term aaxa of the polynomial / can be expressed as the sum of expressions dx^+s, where d £ K, x13 £ /. However, this means that xa £ 7, so that (3) holds. □
Corollary. Two monomial ideals coincide if and only if they contain the same monomials.
The following theorem goes much further. It says that every monomial ideal is finitely generated and, moreover, the finite set of generators may be chosen from any given set of generators.
12.5.8. Theorem (Dickson's lemma). Every monomial ideal
I = (xa, Q £ A) £ K[xi,... ,xn] can be written in the form I = (xai,..., xas), where fti,..., as £ A
Proof. Proceed by induction on the number of variables. If n = 1, then I C K[x], I = (xa, a £ AC 1, N). The set of exponents in A has a minimum, so denote it by (3 := min A. Then, x13 divides each monomial xa with a £ A, so I = (x13). Now suppose n   >   1 and assume that the proposition is true for fewer variables.     Denote the variables  xi,..., xn-i, y,   and  write monomials  in the form xaym where a   £   Nn_1, m   £   N. Suppose that I   C   K[xi,..., xn-i, y] is monomial, and define J C K[ ]by
J := (xa, 3m £ N, xaym £ I).
Clearly, J is a monomial ideal in n — 1 variables, so by the induction hypothesis, J = (xai, xaa). It follows from the definition of J that there are minimal integers m, £ N such that xa'ym' £ I a- Denote m := max{mj} and define an analogous system of ideals Jk C K[xi, xn-i] for 0 < k < m — 1
Jk := (xf3; XPyk £ IA).
863
CHAPTER 12. ALGEBRAIC STRUCTURES
Again, all the ideals Jk satisfy the induction hypothesis, so they can be expressed as
Jfc = (xa<-\...,xa«^).
It remains to show that / is generated by the following finite set of monomials:
xaiym,...,xa-ym,
Consider a monomial x ayp G I. Either of the following cases occurs:
• p   >   m.   Then, xa   G   J, k  =  p, so one of
xaiym,... ,xaaym divides xayp.
• p < m.  Then, analogously, xa G Jk, and one of
xOLk-1yk,... ,xak-3kyk divides xayp.
By the previous lemma, each polynomial / G / can be expressed as a linear combination of monomials from /. Each of these is divisible by one of the generators; hence / lies in the ideal generated. Therefore, / is a subset of that. The other inclusion is trivial, which completes the proof of Dickson's lemma. □
12.5.9. Hilbert's theorem. Everything is now at hand for the discussion of ideal bases in polynomial ^\ rings. The main idea is the maximal utiliza-^f~S|K3r ^ tion of the information about the leading terms among the generating polynomials and in the ideal. For a nonzero ideal / C K[x1,..., xn], denote
LTI := {axa- 3fel: LT f = axa}.
Clearly, (LT 7) is a monomial ideal, so by Dickson's lemma, (LTI) = {LTgx,... ,LTgs) for appropriate g1,...,gsel.
Theorem. Every ideal I G K[xi,..., xn] is finitely generated.
Proof. The statement is trivial for I = {0}. So suppose I 7^ {0}. By Dickson's lemma and the above note, there are 5S G I such that (LT 1) = (LT gu ... ,LT gs).
Clearly, (g±,..., gs) CI. Choose any polynomial / G I and divide it by the s-tuple g\,..., gs.
f = aigi H-----h asgs + r,
is obtained where no term of r is divisible by any of LTgi,... ,LTgs.
Since r = f — a±gi-----asgs, r e I, and also LT r G
LTI. This means that LTr e (LTI). Admit that r ^ 0. Since (LT 7) is monomial, LT r must be divisible by one of its generators, i.e. LT g±,... ,LT gs. This contradicts the result of the multivariate division algorithm. Therefore, r = 0 and I is generated by g\,..., gs. □
864
CHAPTER 12. ALGEBRAIC STRUCTURES
12.5.10. Grobner bases. The basis used in the proof of Hilbert's theorem has the properties stated in the following definition:
Grobner bases of ideals
Definition. A finite set of generators gi,..., gs of an ideal
/ C K[x1,..., xn] is called Grobner basis if and only if {LTl) = {LTgi,...,LTgs).
Corollary. Every ideal I C K.[xi,..., xn] has a Grobner basis. Every set of polynomials g\,..., gs g I such that {LTI) = {LT g±,..., LT gs) is a Grobner basis of I.
Example. Return to the remark on similarity with the Gauss-jjgt ian variable elimination for systems of linear equations. That is, illustrate the general results above on the simplest case of polynomials of delis*-- - gree one with the lexicographic ordering. Denote the generators f = JA a^Xj + ai0. Consider the matrix A = (a^), where i = 1,..., s and j = 0,..., n, and apply the Gaussian ehmination to it. This gives a matrix B = (pij) in echelon form. Zero rows can be omitted from it. Hence there is a new basis g±,..., gt, where t < s.
Due to the performed steps, each f can be expressed as a linear combination g\,..., gt, which means that
(A, • • •     = (si, ■■■,&)■
Now, verify that these polynomials gi,..., gt form a Grobner basis. Without loss of generality, assume that the variables are labeled so that LM gi = Xi for i = 1,..., t. Any polynomial / g I can be written as
/ = hifi H-----h hsfs = h[gi H-----h h'tgt.
It is required that LT f g {LT gx,..., LT gt}. That is, LT f is divisible by one of x\,..., xt. Suppose that / contains only the variables xt+1,..., xn. However, then h[ = 0, since x1 is only in gi by the echelon form of B. Analogously, h'2 =
■ ■ ■ = h't = 0, and so / = 0.
The existence of the very special bases is now proved.
■ However, they cannot yet be constructed algorithmically. This is the goal of the following subsections.
12.5.11. Theorem. Let G = {g±,... ,gt} be a Grobner basis of an ideal I C K[xi,... ,xn] and f a polynomial in K[xi,... ,xn}. Then, there is a unique r = J2a aaxa g K[xi,..., xn] such that:
(1) no term of r is divisible by any ofLTg±,... ,LTgt, i.e.
VaVi: LTgilaaxa;
(2) 3gel:f = g + r.
Proof. The algorithm for multivariate division produces
/ = aiSi H-----h atgt + r,
where r satisfies the condition (1). Select g as a±gi + • • • + atgt, which of course lies in /.
865
CHAPTER 12. ALGEBRAIC STRUCTURES
It remains to prove uniqueness. Suppose that
f = 9 + r = g'+r',
where r ^ r'. Clearly, r — r' = g' —g G /. Since G is aGrob-ner basis, LT(r — r') is divisible by one of LT gi,..., LT gt. There are two possibilities:
• LM r =^ LM r'. The one with the higher degree must be divisible by one of LT g±,... ,LT gt, which contradicts condition (1).
• LMr = LM r' and LC r ^ LCr'. Then both the monomials LM r and LM r' must be divisible by one of LTg±,... ,LTgt, which is again a contradiction.
It follows that LT r = LTV and the inductive argument shows that r = r'. □
The previous theorem generalizes the Euclidean division, with an ideal instead of a divisor. In the univariate case, this is no generalization, since every ideal is generated by a single polynomial. If it is only the remainder which is of interest, the order of polynomials in the Grobner basis does not mat-
_ q
ter. Hence it makes sense to define the notation / for the remainder in the division f/G, provided G = (g±,..., gs) is a Grobner basis.
Corollary. Let G = {gi,...,gt} be a Grobner basis of an ideal I  C  K[xi,..., xn] and f a polynomial in
_ Q
K[xi,..., xn\. Then, f G I if and only if the remainder f is zero.
12.5.12. Syzygies. The next step is to find a sufficient "test-
ting set" of polynomials of a given ideal which allows us to verify whether the considered system is a Grobner basis. Again, we wish to test this by means of >     multivariate division only. For a = multideg / and (3 = multideg g, consider
7 := ill, ■ ■ ■, In),   where 7, = max{a,,
The monomial x1 is called the least common multiple of the monomials LM f and LM g and is denoted LCM(LM f,LMg) := a;7. The expression
s(f> 9) ■= t^tt ' / - j%r- ■ 9 LT f        LT g
is called the S-polynomial (also syzygy, or pair) of the polynomials /, g.
This is a tool for the elimination of leading terms. The Gaussian elimination is a special case of this procedure for polynomials of degree one. However, during the general procedure, it may happen that the degrees of the resulting polynomials are higher even though the original leading terms are removed.
For instance, consider the polynomials
/ = x3y2 - x2y3 +x,    g = 3x4y + y2
866
CHAPTER 12. ALGEBRAIC STRUCTURES
of degree 5 in R[x, y] a with the <griex ordering. Then, 7
(4,2) and
x4v2     x4v2 1 1
which is a polynomial of degree 6.
Theorem. Let I C K[xi,..., xn] be an ideal. Then, G = {gi,..., gt} is a Grobner basis of I if and only if for each i    3, the remainder of the division S(gi ,gf)/G is zero.
Proof. Begin with a technical lemma which describes which cancellations may occur when expressing polynomials in terms of generators. More precisely, they can always be expressed in terms of S-polynomials.
Lemma. Consider a polynomial f = Yli=i cixOL'9h where ci, ..., ct £ K and on + multideg    = 8 for a fixed 8 whenever Ci      0. 7/multideg/ < 8, then •<,',! there are Cjk £ K such that
'1
t t 52cix°'i9i= cjkXS~l3kS(gj,gk),
where x1^ = LCM(LM gj,LM gk), and the degree of each monomial xs~~fjk S(gj, gk) is less than 8.
Proof. Let d{ := LCgi and pi = xa'gi/di. Clearly, c{di = LC(ciXa' g{) and LC pi = 1. Since multideg(cj2;Ql(;j) = 8 and multideg/ < 8, it follows that Yli=i cidi = 0. Express / as a combination of S'-polynomials: t
f = 52cidipi = cidi(pi -P2) + (cidi +c2d2)(p2 - pz)+
i=l
H-----h (cidi H-----h ct_idt_i)(pt_i -pt) +
+ (cidi H-----h ctdt)pt.
0
Each difference pj — pk can be expressed in terms of S-polynomials
xs xs (5_7.   / x1^ x1^
d~x^°3 ~ d^^9k = X        \Wu~'h ~ LTg~k9k
= xs-^kS(g3,gk)
Now the individual coefficients Cjk can be derived easily from these equalities. □
Now follows the proof of the theorem. The "=>" implication follows directly from the corollary of subsection 12.5.11.
For the reverse implication, consider a non-zero polynomial / £ 7. It must be shown that under the implication assumption, LT f £ (LT g\,... LT gt}. If it is known that the polynomial can be expressed as / = Yli=i h-tgi with the property that
multideg / = max{multideg(/ii(^)},
867
CHAPTER 12. ALGEBRAIC STRUCTURES
then LT f is necessarily divisible by one of the leading terms LT gi, which means that G is a Grobner basis.
Denote nij := multideg(/ij<7j), 8 := max{mi,...,mi}. Clearly, multideg/ < 8. Let polynomials hi,... ,ht be chosen so that 8 is as small as possible. Since this is a monomial ordering, which, in particular, is a well-ordering, such a 8 exists.
It is necessary to prove that multideg / = 8. Write
f = Y hi9i +       hi9i =
rrii=8 rrii<c5
= Y (LThi)9i + Y (hi -LThi)9i + Y higi'
Clearly, the degrees of all the terms of the second and third sums are less than 8. Admitting that multideg / < 8, it follows that
multideg I Y (LThi)9t I < S.
\m,=S J
Denote c{xa' := LT h{ and apply the lemma:
Y {LTh{)g, = Y c,xa'g, = Y cjkxs~^kS{gj, gk).
rrii=8 rrii=8 j,k
It follows from the assumption of the theorem and the multivariate division algorithm that
t
S(gj,gk) = YPiikgi
i=l
and, moreover, multideg^jjfcgj) < multideg S(gj, gk). Denote qijk := xs~~fjkpijk, to obtain
t
xs~^'kS(gj,gk) = Y^jk9i-
i=l
By the second part of the lemma,
multideg^fcft) < mu\tideg(xs~1'kS(gj,gk)) < 8. Substitution yields
Y (LThi)gi = YcJk [ Yqvkgi
m.i=5 j,k
= Y j2c^^k 9t
= 1  \ j,k
At the same time,
multideg I Y cjkqijkgi    < S   for i = 1,..., t. \j,k J
Substitute this into the original equality, to get / expressed as a combination of gi,..., gt, where the degrees of all terms are less than 8. This contradicts the minimality of 8, and so multideg/ = 8, whence LT f g (LTgi,... ,LTgt). So G is a Grobner basis. □
868
CHAPTER 12. ALGEBRAIC STRUCTURES
12.5.13. A naive algorithm for Grobner bases. The theorem just proved provides an efficient method for deciding whether a given basis is a Grobner basis. For example, consider I = (x + y,y — z). The only relevant S'-polynomial is
S(x + y,y- z) = —(x + y)--(y - z) =xz + y .
x y
The division yields xz + y2 = z(x + y) + y(y — z), so it is a Grobner basis.
The following algorithm utilizes just this method to find a Grobner basis of an ideal that is generated by a given s-tuple of polynomials F = (fu ..., fs).
(1) G := F,G' :=0
(2) while G^G'
(a) G' := G
(b) Vp, q G G': p ^ q do - G'
(i) s := S(p, q)
(ii) if s 0
G := GU{s}
If the algorithm ever terminates, then G contains a Grobner basis. Thus, it suffices to verify that it terminates. However, in each iteration of the inner loop (2), i.e. when a non-trivial remainder is added, either the monomial ideal generated by LTG extends or it remains unchanged. Consequently, there is a non-decreasing chain of (monomial) ideals Ix = LT(F) C I2 C ■■■ C Ik C .... Denoting I = U^lj/fc, then I is an ideal, and by Hilbert's theorem, it is finitely generated. However, this means that all generators of I already lie in one of the h- Therefore, from this k onwards, Ik = Ik+i = • • • -12
Clearly, the stabilization of this chain of monomial ideals of the leading terms is equivalent to termination of the algorithm.
This algorithm is far from ideal. There are quite trivial inputs for which it returns wild results. Moreover, the output basis directly depends on the input, so the outputs for the same ideal defined by different bases may vary.
12.5.14. Reduction of bases. In order to recognize the generators which are needed in a Grobner basis, it suffices to follow the leading terms. The first step of the discussion is simply to remove all elements which are fS 1 not needed in this sense.
Lemma. Let G be a Grobner basis of an ideal I andp G G such that LTp G (LT(G \ {p})}- Then, G — {p} is also a Grobner basis of I.
Proof. From the definition of the Grobner basis, (LTI)   =   (LTG).   But LTp   G   (LT(G \ {p})), so
12The condition of stabilization of every non-decreasing chain of ideals is called the "ascending chain condition". Rings which satisfy ACC are called Noetherian (in honour of Emy Noether). Hilbert's theorem can also be formulated as "a polynomial ring over a Noetherian ring is again Noetherian".
869
CHAPTER 12. ALGEBRAIC STRUCTURES
(LT(G \ {p})} = (LT G). Hence the proposition follows immediately. □
Definition. A Grobner basis G of an ideal I is said to be minimal if and only if LCp = 1 and LTp ^ {LT(G — {p})) for all p G G.
For instance, consider K[x, y] and <griex, I = (/i, /2) = (a;3 — 2xy, x2y — 2y2 + x). The mentioned algorithm produces the following five polynomials F = (f1,..., f5):
F = (a;3 - 2a;y, x2y - 2y2 + x, -x2, -2xy, -2y2 + x).
Nevertheless, LT fx = a;3 = -xLTf3 and LT f2 = — \xLT fi, so neither fx nor /2 is needed.
However, this reduction is still insufficient, since redundancy may occur at the level of individual terms of the basis elements. For example, for every a, the set {x2 + axy, xy, y2 — l/2x} is a minimal Grobner basis of the considered ideal.
That is why the following concept is introduced:
Reduced Grobner basis
Let G be a Grobner basis of an ideal /. A polynomial g G G is said to be reduced for the basis G if and only if none of its monomials lies in (LT(G \ {<?})). G is said to be reduced if and only if for all p£G, LC p = l and p is reduced for G.
Clearly, every reduced Grobner basis is minimal. Moreover:
Proposition. If a polynomial g is reduced for a minimal Grobner basis G of an ideal I, then it is reduced for every minimal Grobner basis G' of I which contains it.
Proof. In order to arrive at a contradiction, let G = {gx, ■ ■ ■, gs}, G' = {g[, ...,g't} be two minimal Grobner bases. Choose a term m of a polynomial g where m G (LT(G' - {g})) (i.e. g is not reduced for G'). Then, m = ax LT g[ + ■ ■ ■ + at LT g't for appropriate polynomials ax,... ,at. Since both G and G' are Grobner bases of the same ideal, (LT G) = (LT G'), which means that each LT g[ can be expressed as a combination of LT gx, ■ ■ ■, LT gs. Hence m G (LTG). Since G' is minimal, m G (LT(G \ {g})), which contradicts the assumption that g is reduced for G. □
Everything is now available to prove the main result about the existence and uniqueness of a reduced Grobner basis. This is the main achievement of this part of the chapter on algebra. It allows for a ^— straightforward algorithm to eliminate variables in systems of polynomial equations.
12.5.15. Theorem. Let I C K[xi,..., xn] be a non-zero ideal. Then, for every monomial ordering, there is a unique reduced Grobner basis of I. Moreover, every Grobner basis can be reduced algorithmically.
Proof. Assume that G is a Grobner basis of an /. By the lemma of the previous subsection, it can be assumed that G
870
CHAPTER 12. ALGEBRAIC STRUCTURES
is minimal. (The minimizing algorithm is clear: it suffices to check for divisibility of leading monomials in any order and to omit redundant elements of the basis.)
Assume that a polynomial g e G is not reduced. Then in the division g/(G \ {g}), the leading term of g stays in the remainder, since it is not divisible by anything (the basis is minimal). Therefore, LT(gG^9^) = LT g, since nothing else can be the leading term of the remainder. Now, denote
g':=gG\^\    G' := (G \ {g}) U {g'}.
This new system of generators G' is again a minimal Grob-ner basis of 7, because {LT G') = (LTG). That is, (LT G') = (LT I). By properties of the multivariate division algorithm, the polynomial g' is reduced for G'. If a polynomial h ^ g is reduced for G, then it is also reduced for G' by the above proposition.
With every reduction of one of the elements, the total number of terms in all polynomials of the reduced Grobner basis decreases. Therefore, the algorithm terminates as soon as each element is reduced. Hence there is an algorithm for the construction of the reduced Grobner basis.
It remains to prove uniqueness. Let there be two reduced Grobner bases G, G of a non-zero ideal /. Then (LT G) = (LTT) = (LTG). Since this ideal is monomial, Dickson's lemma can be applied.
Recalling the construction of the basis in the proof of Dickson's lemma, there exists a unique monomial basis of a monomial ideal such that the coefficients of its elements equal 1, and no element of the basis divides another one.
By the definition of minimality, both LT G and LT G must be such. This means that LTG = LTG. Consequently for each g e G, there is a unique j e G such that LT g = LT g.
_ Q
g — gel. Since G is a Grobner basis, g — g =0. The terms LT g,LT g cancel out in g — g. Since both the bases are reduced, none of the remaining terms of g — g may be divisible by any of LTG = LTG. Therefore, it must be in the remainder, which means that
~ -= G n
g-g = g-g = o.
This proves the uniqueness. □
12.5.16. Remarks. Several of the previous questions are now answered: It can be decided efficiently whether or not a given polynomial
|gi 3 lies in a given ideal by means of the mul-■////'"' tivariate division and the Grobner basis. Because of the reduced Grobner bases, it can be decided whether or not two ideals coincide - they simply need to have the same reduced Grobner basis.
This means that it can be decided whether of not a polynomial equation lies in the ideal generated by a given system. Moreover, it can be decided efficiently whether or not two given systems generate the same ideal of consequences.
871
CHAPTER 12. ALGEBRAIC STRUCTURES
The above algorithmic construction depends on an appropriate monomial ordering. The answers to the questions are, of course, independent of such ordering.
As mentioned at the beginning of this chapter, the technique of Grobner bases is one of the fundamentals of computer algebra. Of course, this algorithm is usually implemented using various tricks to make it faster. One can use the reduction technique as early as when creating the Grobner basis in the fundamental algorithm from subsection 12.5.13, etc.
In the literature, one may find miscellaneous variations for non-commutative algebraic objects (e.g. for formal manipulations with differential operators). The algorithm for finding a Grobner basis can be viewed as a special case of the Knuth-Bendix algorithm for rewriting some rules. This solves the problem of word equivalence in monoids that are given by generators and a set of equalities.
Last, but not least, the technique of Grobner bases can be used in a much more sophisticated way in commutative algebra. In the algorithm, the syzygies of all pairs of generators of a finite basis can be found sequentially. These syzygies are a basis of the submodule of all relations between k elements gi,...,gk of the basis, that is, subsets S in the space (K[xi,..., xn])k. The algorithm can be extended to these subsets, to find distinguished generators of all relations between generators. As long as there are some non-trivial relations the procedure can continue. It can be proved that after at most n such steps, there exist no non-trivial relations. The number of generators of relations in each step provides detailed information about the topological properties of the corresponding affine variety        ..., gk)-
12.5.17. Elimination of variables. We finish this chapter by
an application of the above algorithms. /udi>--^P       Consider the ring K[a;p+i,... ,xn] to be
a subring of K[x1,... ,xn]. These are polynomials with no occurrence of the variables
xi,...,xp. It is a subring, but not an ideal.
Elimination ideals
Let / = (/i,...,/s) C K[xi,...,a;n]. For p = l,...,n, define
Ip := /n%+i,.. .,xn]. This set is called the p-th elimination ideal. Note that Ip is an ideal only in K[a;p+i,..., xn].
On the level of polynomial equations, Ip contains all equations which are consequences of the system /i = 0,..., fs = 0 and which contain only the variables
*£p+i> • • •: xn.
Theorem (Elimination theorem). Let I C K[xi,..., i„] be an ideal and G = {gi, ..., gm} a Grobner basis of I with respect to <iex. Let the variables be ordered as x\ >tex %2 >iex ''' • Then, for each p = 0, ..., n, Gp := G n K[xp+i,..., xn] is a Grobner basis for the ideal Ip.
872
CHAPTER 12. ALGEBRAIC STRUCTURES
If G is minimal or reduced, then Gp is again minimal or reduced, respectively.
Proof. Without loss of generality, assume that GP = {gi, • • •, 9r}- Since G C I, it follows that Gp C Ip. The inclusion (Gp) C 7P is trivial.
It needs to be verified for each polynomial / g Ip that
/ = h1g1 H-----h /irgr.
To do this, perform multivariate division by the original Grob-ner basis G. Since / g 7, it follows that /   =0, i.e.
/ = h1g1 H-----h hrgr + hr+1gr+1----h hmgm.
Each of the polynomials gr+i,gm must contain at least one of the variables x1,..., xp, otherwise it would lie in Gp. By the properties of the lexicographic ordering, this variable must also be contained in LT gr+±,..., LT gm.
Recall the individual steps of the algorithm for multivariate division and the fact that / contains no monomial with any of xi,..., xp. Then hr+i = ■ ■ ■ = hm = 0. thus verifing that/g {Gp).
Not only the desired inclusion but also the fact that the division f /G on Ip gives the same result as f/Gp is proved. For 1 < i < j < r, consider the S'-polynomials S(gi, gj).
S(9i,9j) G" = S(9i,9j) G = 0,
so Gp is a Grobner basis of Ip.
It is clear that the property, for the basis, of being either minimal or reduced, is preserved. □
The only property of the lexicographic ordering used in the proof is that if a variable occurs in the polynomial /, then it occurs in its leading term as well. However, this condition is much weaker than that of the lexicographic ordering. Therefore, in actual implementations, one may use any ordering with the mentioned property. This usually leads to more efficient computations, since the pure lexicographic ordering usually leads to an unpleasant increase of the polynomials' degrees.
12.5.18. Back to parametrized varieties. The above theorem suggests an algorithm for finding an implicit representation of a variety denned in terms of polynomial parametriza-tion. Tools necessary for work with the smallest varieties that contain the points denned by parametrization, are not here available, so a detailed discussion is omitted.
When the parametrization of a variety is given by polynomial relations
%1 = fl(ui,- ■ ■ ,Uk), ...,Xn = /„(«!, . . .,Uk),
the reduced Grobner basis of the ideal
(^1    fi,-.., xn fn)
can be computed in the lexicographic ordering where Ui > xj for all From this basis, the reduced Grobner basis of the elimination ideal Ik is obtained. This is precisely the required ideal and its implicit representation.
873
CHAPTER 12. ALGEBRAIC STRUCTURES
It suffices to use an ordering which guarantees that each ui is above each Xj, so that the computation of the Grobner basis would eliminate u{; otherwise the ordering may be arbitrary. There is a chance that there is a more efficient computation than with the pure lexicographic ordering.
When the parametrization is rational, i.e.
^ _        fi ifl5 • • • 5 tyn)
9i{ti, ■ ■ ■, tm)' it is perhaps natural to think of substituting the ideal
{xiQi - fi, ■■■ , xngn - fn)
into the above theorem. However, the result of this is usually not good. For instance, consider
v? v2 x = —,   y=—,   z = u. V u
Here, I = (vx — u2 ,uy — v2, z — u), and the elimination yields h = (z(x2y — z3)). However, the correct result is %}(x2y — z3). The computation has added an entire plane.
The problem is that the entire variety of zero points of the denominators in the parametrizations of individual variables is included in W = ... ,gn)- Instead, perceive the
parametrization F as a mapping F : (Kk — W) —> K". For the implicit situation, use the ideal
I = {gixi - fj,- ■ ■ ,gnxn - fn,i - gi ■ --g-ny) c
C K[y,tx,... ,tm,x1,... ,xn],
where the additional variable y enables avoidance of the zero spaces of the denominators. It can be shown that V(Ik+i) is the minimal affine variety which contains F(Km — W).
874
CHAPTER 12. ALGEBRAIC STRUCTURES
Key to the exercises
12.A.2. A' A C
12A3. (A NAND (B NAND B)). 12A.8. E.g., {1,2,3,12,18}.
12A.10. There are six isomorphism classes. In three of them ("Y, dual Y, and pentagon"), there is an element incomparable with two other
ones, yielding 5! partial orders. In the other three ("X, house, and dual house"), there are two pairs of different incomparable elements,
thus yielding only 5!/4 partial orders. Altogether, there are 450 partial orders.
12.B.7. 31.
12.B.8. 45.
12.B.9. 63.
12.B.10. 33.
12.C.1. Suppose the contrary, i. e., that the polynomial is a product of two polynomials with integer coefficients. Use induction to prove that all the coefficients of one of these polynomials are divisible by p (begin with the absolute term). However, then the leading coefficient of f(x) is also divisible by p.
12.C.12. OverR: (x - l)(2x2 - x + l)2, over C: (x - 1) (x - ^^f^. 12.C.13. OverR: (x + l)(x2 + x + 2)2, overC: (x + 1) (x + ^f^j'.
12.C.14. Over R: (x2 - 3x + 2)2, over C: (x - 1 + -J2if (x - 1 - V2if. 12.C.1S. x5 + x2 + 2x + 1 = (x2 + l)(x3 + 2x + 1).
12.C.16. x4 + 2x3 + 2 is irreducible. It has no roots and it cannot be written as a product of two quadratic polynomials (this must be verified!).
12.E.3. i) not even a groupoid (the operation is not closed on the given set), ii) a non-commutative monoid, iii) a commutative group, iv) a non-commutative group, v) a commutative group, vi) a commutative monoid.
12.E.4. Since multiplication of complex numbers is commutative, it must remain such on any subset as well. The particular cases are:
i) a monoid,
ii) a group,
iii) a group.
12.E.5.
i) a commutative semigroup, but not a monoid;
ii) a non-commutative groupoid, but not a semigroup;
iii) a non-commutative groupoid, but not a semigroup;
iv) a non-commutative semigroup, but not a monoid;
v) a commutative monoid, but not a group;
vi) a commutative monoid, but not a group;
vii) a non-commutative group.
12.E.8. It is a non-commutative group. 12.F.1.
i) (1,3,5,7,2,4,6)
ii) (1, 3, 2) o (4, 6, 5), (1,4, 2, 5, 3, 6), (1, 5,2, 6, 3,4), (1, 6, 2,4, 3, 5)
iii) none exists
12.F.4. Due to the parity, no such permutation exists.
12.F.14.
i) Yes
ii) No
iii) Yes
iv) No
v) Yes
vi) No
vii) No
viii) No
12.F.15.
i) Yes
875
CHAPTER 12. ALGEBRAIC STRUCTURES
ii) No
12.F.16. m = 10. (Note that for m = 8 and m = 12, the resulting groups have the desired number of elements but are not isomorphic to Z*.)
12.F.27. This is generally not true. Consider e. g. Sn/An ~ Z2, n > 3.
12.F.32. The subgroup has four elements; the remaining one is the reflection with respect to the plane that is perpendicular to the former
one and contains the axis of the rotation (it is isomorphic to the Klein group Z2 x Z2). It is not normal.
12.F.42.
i) An isomorphism.
ii) A homomorphism, neither surjective, nor injective. in) Not a homomorphism.
12.F.43.
i) A surjective homomorphism, not injective.
ii) A surjective homomorphism, not injective.
ill) A homomorphism, neither surjective, nor injective.
iv) Not a homomorphism.
v) A homomorphism, neither surjective, nor injective.
vi) A homomorphism, neither surjective, nor injective.
12.F.44.
i) A surjective homomorphism, not injective.
ii) Not a homomorphism. in) Not a homomorphism.
12.F.45.
i) An injective homomorphism, not surjective.
ii) Not a correct definition, since the result does not lie in the specified codomain A4. 12.F.46.
i) An injective homomorphism, not surjective.
ii) Not a homomorphism. in) Not a homomorphism. iv) Not a homomorphism.
12.F.47.
i) A homomorphism, neither injective, nor surjective.
ii) Not a mapping.
in) A surjective homomorphism, not injective.
12.G.7. 7. 12.H.2.
G =
12.G.5. 36 ^ (6!) 12.G.6. i
+ 2-3!+ 2
24! (8!f
(4!)-
A	0	1	A
1	1	1	0
0	1	1	1
1	0	0	0
0	1	0	0
0	0	1	0
\o	0	0	V
6! T2!F
+
(3!)-
+ 18^
+ 2#+4-3! + 24
H =
0 0
k0 0
(4!)J
= 477368. 197216213.
12.H.7. 110.
876
CHAPTER 13
Combinatorial methods, graphs, and algorithms
Do we often prefer thinking in pictures?
- yes, but we can compute discrete things only...
A. Fundamental concepts
One of the motives for creating graph theory was visualization of certain problems concerning relations. A human brain like thinking about entities it can imagine. Therefore, we like representing a binary relation with a graph whose vertices correspond to the elements and edges (lines between the elements) correspond to the fact that the given pair is related. Optionally, we can encode a relation in a more complicated way-use Hasse diagram (see 12.1.8), for instance. Partially ordered sets are almost always depicted this way. The relation of friendship or acquaintance between people can also be translated to graphs. This gives rise to a good deal of "relaxing" problems.
13. A.l. In a dormitory, there is a party held every night. Every time, the organizer of the party invites all of his/her acquaintances so that at the end of the party, all of the guests
In this chapter, we return to problems concerning properties or mutual relations of (mainly) finite sets of objects. Combinatorial problems are already introduced in the second and third parts of chapter one.
Like number theory, combinatorics is a field of mathematics where the problems can often be formulated very easily. On the other hand, solutions can be much more difficult to find.
We begin with graph theory, and display a collection of useful algorithms based on this theory. At the end of the chapter methods of combinatorial computations are considered.
1. Elements of Graph theory 13.1.1. Two examples. Several people come to a party;
some pairs of people know each other, while other people know nobody. (Acquaintance is assumed to be symmetric). How many people
must be there in order to guarantee that there are either three people who all know each other, or there are three people with no mutual acquaintance?
Such situations can be aptly illustrated by a diagram. The points (or vertices) stand for the particular people of the party, the full lines represent pairs who know one another, while the dashed lines stand for pairs who do not know one another. Note that every pair of vertices is connected by either a full or a dashed line. The question is now reformulated as: how many vertices must be there in order that either there is a triangle whose sides are all full or a triangle whose sides are all dashed?
There is no such triangle in the left-hand diagram with four vertices. The example of a regular pentagon, in which all its outside edges are full, while all its diagonals are dashed (draw a picture!), shows that at least six vertices are required.
Such a triangle always exists if the number of vertices is at least six. To show this, consider a set of six vertices, each pair of which is joined by either a dashed line or a full line.
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
know each other. Suppose that each member of the dormitory has organized a party at least once, yet there are still two students who do not know each other. Show that they will not meet at the next party.
Solution. Consider the acquaintance graph of the students at the beginning (the vertices correspond to the students, and the edges to the acquaintances). We are going to show that if two students lie in the same connected component of this graph (i. e., there exists a chain of acquaintances beginning with one of the considered students and ending with the other one), see 13.1.10, then they will know each other as soon as each member of the dormitory has held a party. Indeed, consider the shortest path (acquaintance chain) between two students that lie in the same connected component. Every time someone from this path organizes a party, this path is made one shorter (the organizer falls out). Since we assume that each of the students on the path has organized a party, the marginal students must know each other as well. Therefore, if there are two students who do not know each other even after everyone has held a party, then they lie in different connected components of the graph, so they will never meet at a party (in particular, not at the upcoming one). □ Now, we are going to practice the fundamental concepts of graph theory on simple combinatorial problems.
13.A.2. Determine the number of edges of each of the graphs K6, K5t6, Cs.
Solution. The complete graph K6 on 6 vertices has (®) =15 edges. The complete bipartite graph K^q (see 13.1.3) has 5 ■ 6 = 30 edges. Finally, the cycle graph Cs has 8 edges.
□
13.A.3. Degree sequence. Verify whether each of the following sequences is the degree sequence (see 13.1.7) of some graph. If so, draw one of the corresponding graphs.
i) (1,2,3,4,5,6,7,8,9),
ii) (1,1,1,2,2,3,4,5,5).
Solution. First of all, we should check the necessary condition from (1). In the former case, we have 1 + • • • + 9 = i ■ 9 ■ 10 = 45, so the condition is not satisfied. Therefore, the first sequence does not correspond to any graph.
As for the latter sequence, the sum of the wanted degrees equals 24, so the necessary condition is satisfied. Now, we proceed by the Havel-Hakimi theorem from subsection 13.1.7.
If v is one of the vertices, then it is joined by five outgoing lines.
At least three of these lines are of one type, without loss of generality, full, joining to vertices va,vb,vc- Then either the triangle formed by the vertices va,vb,vc, contains only dashed lines, which is then the desired triangle, or one of its edges is full in which case there is a full triangle.
As another example, consider a black box which consumes one bit after another and shines in blue or in red according to whether the last bit is zero or one. Imagine this could be a light over the toilet door recognizing whether the last person came out (0) or in (1). Again, this scheme can be illustrated by a diagram:
The third vertex which has only two outgoing arrows represents the beginning of the system (before the first bit is sent).
Both situations share the same scheme: there is a finite set of objects represented by vertices. There is a set of their properties represented by connecting lines between particular vertices. The scheme can be modified by distinguishing the directions of the connecting lines by arrows.
Such a situation can be described in terms of relations; see the text from subsection 1.6.1 on in the sixth part of chapter one. But this is a complicated terminology for describing simple situations: In the first case, there is one set of people with two complementary symmetric and non-reflexive relations. In the second case, there are two antisymmetric relations on three elements.
13.1.2. Fundamental concepts of graphs. We use the ter-j«V minology which corresponds to the latter dia-ijui-^ grams.
Graphs and directed graphs
Definition. A graph (also an undirected graph) is a pair G = (V,E), where V is the set of its vertices and E is a subset of the set (^) of all 2-element subsets of V.
The elements of E are called edges of the graph. The vertices of an edge e = {v, w}, v ^ w, are called the end-points of e. An endpoint of an edge is said to be incident to that edge. Two edges which share a vertex are called adjacent. Any two vertices which are the endpoints of an edge are called adjacent.
878
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
(1,1,1,2,2,3,4,5, 5) <—► (1,1,1,1,1,2,3,4) <—► <—► (1,1,1,0, 0,1,2) <—► (0, 0,1,1,1,1, 2) <—► <—► (0, 0,1,1, 0, 0) <—► (0, 0, 0, 0,1,1) <—► «—►(0,0,0,0, 0).
Of course, it was not necessary to execute the procedure to the very end. We could have finished as soon as we saw that the obtained sequence indeed is the degree sequence of some graph. Now, we construct the corresponding graph "backwards" (however, we must take care to always add edges to vertices of appropriate degrees- it is this place where we have the option and can obtain non-isomorphic graphs with the same degree sequence). One of the possible outcomes is the following graph (the order in which each vertex was selected is written inside it):
' 2
□
13.A.4. Find the number of pairwise non-isomorphic complete bipartite graphs with 1001 edges.
Solution. A complete bipartite graph Km^n has m ■ n edges. Therefore, the problem can be stated as follows: In how many ways can we write the integer 1001 as the product of two integers? Since 1001 = 7-11-13, we get 1001 = 1 ■ 1001 = 7- (11 ■ 13) = 11 ■ (7-13) = 13- (7- 11).
Thus, there are four non-isomorphic complete bipartite graphs having 1001 edges:
-^1,1001, -^7,143, ^11,91 andK13,77- □
13.A.5. Find the number of graph homomorphisms (see 13.1.4)
a) from P2 to K5,
b) from K3 to K5.
A directed graph is a pair G = (V,E), where V is as above, but now, E C V x V. The first of the vertices that define an edge e = (v, w) is called the tail of the edge and the other vertex is called its head. From the vertices' point of view, e is an outgoing edge of v and an ingoing edge of w. The directed edges are also called arcs or arrows. The head and the tail of a directed edge may be the same vertex; such an edge is called a loop.
Two directed edges are called consecutive if the tail of one of them is the head of the other one. Similarly, two vertices which are the head and the tail of an edge are called consecutive.
To every directed graph G = (V, E), its symmetrization can be assigned. This is an undirected graph with the same set of vertices as G. It contains an edge e = {v, w} if and only if at least one of the edges e' = (v, w) and e" = (w, v) belongs to E.
Graph theory provides an extraordinarily good language for thinking about procedures and deriving properties that concern finite sets of objects. They are a good example of a compromise between the tendency to "think in diagrams" and precise mathematical formulations.
The language of graph theory allows the adding of information about the vertices or edges in particular problems. For instance, the vertices of a graph can be "coloured" according to membership of the corresponding objects to several (pair-wise disjoint) classes. Or the edges with several values can be labeled, and so on. The existence of an edge between differently coloured vertices can indicate a "conflict". For example, if the vertices are coloured red and blue according to membership to two groups of people with different interests and the edges represent adjacency at a dining table, then an edge connecting two differently coloured vertices can mean a potential conflict. Our first example from the previous subsection can thus be perceived as a graph with coloured edges. The statement we have checked there reads thus in the language of graph theory:
A graph Kn = (VJ (^)) with n > 6 vertices and all possible edges which are labeled with two colours always contains a triangle whose sides are of the same colour.
The directed graph in the second example above, whose edges are labeled with zero or one, represents a simple finite automaton. This name reflects the idea that the graph describes a process which is, at any moment, in a state represented by the corresponding vertex. It changes to another state, in a step represented by one of the outgoing edges of that vertex. The theory of finite automata is not considered here.
13.1.3. Examples of useful graphs. The simplest case of graphs are those which contain no edges. There f/ is no special notation for them. At the other extreme is a graph which contains all possible edges. This is called a complete graph, denoted by Kn, where n is the number of vertices of the graph. The graphs K4 and
879
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. We can see from the definition of the graph homo-morphism that the only condition which must be satisfied is that adjacent vertices must not be mapped to the same vertex.
a) 5 ■ 4 ■ 4 = 80.
b) 5 ■ 4 ■ 3 = 60. D
13.A.6. Number of walks. Using the adjacency matrix (see 13.1.8), find the number of trails of length 4 from vertex 1 to vertex 2 in the following graph:
T
Solution. The adjacency matrix of the given graph is
1 1 0
°\
1
1 1
0/
The number of walks of length 4 from vertex 1 to vertex 2 is the element at [1, 2] in the matrix AG. Since
Al
(2
1
1
2
V2
2\
2 2 2
3/
we have (AG)h2 = (2,1,1,2,2)-(l, 4,3,2,2)T = 17. Therefore, there are 17 walks of length 4 between the vertices 1 and 2. □
13.A.7. A cut edge (also called bridge) in a graph is such an edge that its removal increases the number of connected components of the graph. Similarly, a cut vertex (also called articulation point) is a vertex with this property, i. e., when removed (with the edges incident to it, of course), the graph splits up into more connected components.
Find all cut edges and vertices of the following graph:
12) (13
K6 are presented in the introductory subsection. The graph K3 is called a triangle.
An important type of graph, is a path. This is a graph whose vertices are ordered as (v0,...,vn) so that E = {ei,e„}, where e, = {i>i_i, Vi} for alii = l,...,n. A path graph of length n is denoted by Pn. If the first and last vertices coincide for the path graph (n > 3), it is called a cycle graph of length n, denoted by Cn. The graphs K3 = C3, C5, and P5 are shown in the following diagram.
A On]
Another type of graph is the complete bipartite graph. Its vertices can be coloured with two (distinct) colours. All possible edges between vertices of different colours are present, but no other edges. Such a graph is denoted by Km^n, where m and n are the numbers of vertices of particular colours. The diagram below illustrates the graphs Klt3, K2t3, and a33.
V
Another interesting example of a graph is the hyper-cube Hn in dimension n, whose vertices are the integers 0,..., 2™ — 1 and whose edges join those pairs of vertices whose binary expansions differ by exactly one bit. The following diagram depicts the hypercube H4, with labels of the vertices indicated.
From the definition it follows that a hypercube of given dimension can always be composed from two hypercubes of dimension one lower, connecting them with edges in an appropriate way. These new edges between the two disjoint copies of H3 are the dashed ones in the diagram. Obviously, the hypercube H4 can be similarly decomposed in several ways (just looking at one fixed bit position, as is done with the very first position in the diagram).
1110 mi
0100
1011
0000
0001
Here are two more examples. The first is the cycle ladder graph CLn with 2n vertices. This consists of two cycle graphs Cn whose vertices are connected by edges according to their order in the cycles. The second is the Petersen graph.
880
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
o
13.A.8. Prove that a Hamiltonian graph (see 13.1.13) must be 2-vertex-connected. Give an example of a graph which is 2-vertex-connected yet does not contain a Hamiltonian cycle.
Solution. Considering any pair of vertices in a Hamiltonian graph, there are two disjoint (except for the two vertices) paths between them (the "arcs" of the Hamiltonian cycle). Therefore, if we remove one of the vertices, the graph clearly remains connected (the vertex to be removed lies on one of the two paths only). As for the example of a non-Hamiltonian graph which is 2-vertex-connected, we can recall the Petersen graph (see the picture at the beginning of this chapter). □
13. A.9. Determine the number of cycles (see 13.1.3) in the graph #5.
Solution. We sort the cycles by their lengths, i. e., we count separately the numbers of cycles upon three, four, and five vertices. A cycle of length three is determined uniquely by its three vertices, and there are (g) ways how to choose them. A cycle of length four is determined by its vertices (which can be chosen in (^) ways) and the pair of neighbors of a fixed vertex (the pair can be selected from the remaining three vertices in (3) ways). Finally, a cycle of length five is given by the pair of neighbors of a fixed vertex as well as the other neighbor (from the two remaining) of a fixed vertex of this pair. Altogether, there are
cycles. □
13.A.10. Determine the number of subgraphs (see 13.1.4) of the graph K5.
Solution. Again, we count the number of subgraphs separately by the number v of their vertices:
• v = 0. There is a unique graph on 0 vertices, the empty graph.
• v = 1. There are 5 ways of selecting 1 vertex, resulting in 5 subgraphs.
• v = 2. Two vertices can be chosen in ways. Further, there may or may not be an edge between them. Altogether, we get (g) ■ 2 such subgraphs.
This is somewhat similar to CL5, yet it is actually the simplest counterexample for many propositions about graphs.
13.1.4. Morphisms of graphs. Mappings between the sets i§t 'if'-    °f veruces or edges which respect the consid-•si\     ered structure are of great importance in graph theory. It is enough to consider mappings between the vertices only.
Morphisms of graphs
Definition. Let G = (V,E) and G' = (V',F?) be two given graphs. Amorphism (or homomorphism) f : G —> G' is a mapping fv  ■ V —> V between the sets of vertices such that if e = {v, w} is an edge in E, then e' = { fv (v), fv (w)} is an edge in E.
In practice, there is no need to distinguish between the morphism / and the mapping fv.
The definition is the same for directed graphs, using ordered pairs e = (v, w) as edges.
In the case of undirected graphs, the definition implies that if f(y) = f(w) for distinct v, w e V, then they are not connected by an edge. On the other hand, such an edge is admissible for directed graphs provided the common image of the vertices has a loop.
An important special case of a morphism of a graph G is one whose codomain is Km. Such a morphism is equivalent to a labeling of the vertices of G with m names of the vertices of Km so that vertices of one colour are not adjacent. In this case, it is a (vertex) colouring of the graph G with m colours.
If a morphism / : G —> G' is a bijection between the sets of vertices such that the inverse mapping J-1 is also a morphism, then / is called an isomorphism of graphs. Two graphs are isomorphic if they differ only in the labeling of the vertices.
Every morphism of directed graphs is also a morphism of their symmetrizations. The converse is not true in general.
There are simple and extraordinarily useful examples of graph morphisms: namely a path, a walk, and a cycle in a graph:
881
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
• v = 3. Three vertices can be selected in (g) ways. For each pair of them, there may or may not be an edge. This results in (g) ■ 2(2) subgraphs.
• v = 4. Here, we calculate (^) ■ 2(2) subgraphs.
• v = 5. Finally, in this case, there are (j?) ■ 2 (2) subgraphs.
Altogether, we have found 1550 subgraphs of the graph K5.
□
13.A.11. Determine the number of paths between a fixed pair of different vertices in the graph K7.
Solution. We sort the paths by their lengths. There is a unique path of length one (it consists of the edge that connects the selected vertices). There are five paths of length two (it may lead through any of the remaining vertices). There are 5 ■ 4 paths of length three (we select the two vertices through which it leads, and their order matters). Similarly, there are 5-4-3 paths of length four, 5-4-3-2 paths of length five, and also 5! paths of length six. Clearly, there are no longer paths in K7. Altogether, we have 1 + 5+5-4+5-4-3+51+5! = 326 paths. □ At the end of this subsection, we present one more amusing problem.
13.A.12. The towns of a certain country are connected with roads. Each town is directly connected to exactly three other towns. Prove that there exists a town from which we can make a sightseeing tour such that the number of roads we use is not divisible by three.
Solution. First of all, we reformulate this problem in the language of graph theory. Our task is to prove that every 3-regular graph (i. e., such that the degree of every vertex equals three) contains a cycle whose length is not divisible by three. We will proceed by induction, and actually, we will prove a stronger proposition: every graph each of whose vertices has degree at least three contains a cycle whose length is not divisible by three. In fact, the original proposition could not be proved by induction since the induction hypothesis would be too weak. The induction will be carried on the number k of vertices of the graph. Clearly, the statement holds for k = 4. Now, consider a graph where the degree of each vertex is at least three and suppose that the statement is true for any such graph on fewer vertices. The reader should be able to prove that there exists a cycle in the graph. If its length is not divisible by three, we are done. Thus, suppose from now on that C = v\v2 ■ ■ ■ v3n. Each vertex of this cycle is connected to at
Walks, Paths, Trails, and Cycles
A walk of length n in a graph G is a morphism s : Pn —> G. Both vertices and edges may repeat in the image.
A trail is a walk, where vertices are allowed to repeat, but edges are not allowed to repeat.
A path of length n in a graph G is any morphism p : Pn —> G such that p is an injective mapping. The images of the vertices vq, ... ,vn from Pn are pairwise distinct.
A cycle of length n in a graph G is any morphism c : Cn —> G such that c is an injective mapping of the vertices.
For simplicity, the morphism is often identified with its image. Walks are often written explicitly in the form (v0,e1,v1,... ,en,vn), where e, = {^_i,^}fori = l,...,n.
A walk can be thought of as the trajectory of a "pil-W ^ grim" moving from the vertex f(v0) to the ver-*JM§r^f texf(vn)'not stopping at any vertex of an (undi-O^jMedr-^ rected) graph. Pn always contains an edge connecting the adjacent vertices vi_1 and v{, while loops are not admitted in undirected graphs. The pilgrim can enter a vertex more than once or even go along an edge already visited. The pilgrim making a "trail" is a little wiser - he does not go along an edge already visited for the second time on his walk from the initial vertex f(v0) to the terminal vertex f(yn).
13.1.5. Subgraphs. The images of paths, walks, and cycles are examples of subgraphs, but not in the same way.
Subgraphs
Definition. A graph G' = (V, E) is a subgraph of a graph G = (V, E) if and only if and only if V C V, and E C E.
Consider a graph G = (V,E). Choose a subset V C V. The largest subgraph (with respect to the number of edges) with V as its set of vertices is called an induced subgraph. It is the graph G' = (V',E), where an edge e e E belongs to E if and only if both of its endpoints lie in V. Therefore, the set E of G"s edges is given as the intersection E n    ).
A spanning subgraph (also factor) of a graph G = (V,E) is any graph G' = (V,E) with V = V. Hence G' has the same vertex set as G, but the set of edges may be arbitrary. A clique is a subgraph of the graph G which is isomorphic to a complete graph.
Every subgraph can be constructed by a step-by-step application of these two cases - first, select V C V, then choose the target edge set E in the subgraph induced on V.
Every image of a homomorphism (vertices as well as edges) forms a subgraph.
882
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
least one more (different from the neighbors on the cycle) in the graph. If there is a vertex v{ on the cycle that is connected to a vertex vj on the cycle (j > i +1), then the lengths of the cycles viv2 ■ ■ ■ ViVjVj+i... v3n and ViVi+i... vj total up to 3n + 2, so the length of at least one of them is not divisible by three, as wanted. The situation is similar if there are two vertices v{ and Vj, 1 < i < j' < 3n, which are connected to the same vertex outside the cycle.
Therefore, suppose that each vertex of the cycle is connected to some vertices outside C and no two vertices of the cycle are adjacent to the same vertex outside. Then, we can consider the graph which is obtained from the original one by replacing the vertices v1,v2,..., v3n with a single vertex V.
In this new graph, there are also at least three edges leading from each vertex (including V), so we can apply the induction hypothesis to it. Therefore, there is a cycle w1w2 ... where 3 \ k. If it does not contain the vertex V, then it is a cycle in the original graph as well. If it does, then we proceed analogously as above: we consider two cycles whose lengths sum up to 3n + 2k, so the length of at least one of them is not divisible by three. We have found the wanted cycle in every case, which finishes the proof. □
B. Fundamental algorithms
Let us begin with breadth-first search and depth-first search, which serve as a basis for more sophisticated algorithms. Their actual implementations may differ; therefore, the answers to the following problems may be ambiguous.
13.B.1. Consider a graph on six vertices which are labeled 1, 2,..., 6. A pair of vertices is connected with an edge if and only if the sum of their labels is odd. Describe the run of the breadth-first search algorithm on this graph. Which of
13.1.6. How many non-isomorphic graphs are there? It
jj> i, is easy to draw all graphs (up to isomorphism) with a predetermined small number of vertices (three or four for instance). Generally this is a complicated 1 combinatorial problem. It is often difficult to decide whether or not two given graphs are isomorphic.
Remark. This problem, known as the Graph isomorphism problem, is a somewhat peculiar member of the class NP1 -it is known neither whether it is NP-complete nor whether it can be solved in polynomial time. This is a special case of the problem of deciding whether or not a given graph is isomorphic to a subgraph of another graph. This Subgraph isomorphism problem is known to be NP-complete.
It is difficult to answer precisely the question at the beginning of this subsection. There are the same number of graphs on a given set of n vertices as the number of subsets of the edge set. A fc-element set has 2k subsets. There are at most n\ graphs isomorphic to a given one, since this is the number of bijections between n-element sets. It follows that there are at least
2(3)
*(») = —
pairwise non-isomorphic graphs. From this,
log2 k(n)
log2 n\ >
1
1     2 log2 n
since n\ <nn. The asymptotic estimate for large n:
\og2k(n) > \t? - 0(nlog2n),
follows. (See the notation for asymptotic bounds from subsection 6.1.16 on page 391).
13.1.7. Vertex degree and degree sequence. It is relatively easy to verify that two given graphs are not iso-^ morphic. Since isomorphic graphs differ in the relabeling of the vertices only, they share all nu-_ merical and other characteristics which are not changed by the relabeling. Simple data of this type includes, for instance, the number of edges incident to particular vertices.
^Wiklpedia, NP (complexity), http://en.wikipedia.org/ wiki/NP_ (complexity) (as of Aug. 7, 2013,13:44 GMT).
883
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
the edges is visited at the end provided the search is initiated at vertex 5 and the neighbors of a given vertex are visited in ascending order?
Solution. The algorithm starts at vertex 5 and goes along the edges (5,2), (5,4), (5,6), thereby visiting the vertices 2, 4, 6 (the queue of vertices to be processed is 2,4,6). The first vertex to have been visited is 2, so the algorithm continues the search from there, i. e., vertex 5 is processed and vertex
2 becomes active. The algorithm goes along the edges (2,1), (2,3), (2,5) (the last one has already been used), thereby visiting the vertices 1 and 3 (the queue of vertices to be processed is 4,6,1,3). Now, vertex 2 becomes processed and the first unprocessed vertex to have been visited becomes active. That is vertex 4. The algorithm discovers the edges (4,1) and (4,3), yet no new vertices. Vertex 4 becomes processed and vertex 6 becomes active. This leads to discovery of the edges (6,1) and (6,3). If the algorithm know the number of edges in the graph, it terminates at this moment. Otherwise, it goes through the vertices 1 and 3, finding out that there are no new edges or vertices, and then it terminates. In either case, the last edge to have been discovered is (3,6). □
13.B.2.   Consider a graph on six vertices which are labeled
1, 2,..., 6. A pair of vertices is connected with an edge if and only if the sum of their labels is odd. Describe the run of the depth-first search algorithm on this graph. Which of the edges is visited at the end provided the search is initiated at vertex 5 and the neighbors of a given vertex are visited in ascending order?
Solution. The algorithm starts at vertex 5 and goes along the edges (5,2), (5,4), (5,6), thereby visiting the vertices
2, 4, 6 in this order (the stack of vertices to be processed is 6,4,2). Vertex 5 becomes processed and the last vertex to have been visited (i. e., vertex 6) becomes active. The algorithm goes along the edges (6,1) and (6,3) (the edge (6, 5) has already been used), thereby visiting the vertices 1 and
3 (the stack of vertices to be processed is 3,1,4,2). Now, vertex 2 becomes processed and the last unprocessed vertex to have been visited becomes active. This is vertex 3. The algorithm discovers the edges (3,2) and (3,4), so the stack becomes 4,2,1,4,2. Vertex 3 becomes processed and vertex
4 becomes active. This leads to discovery of the edge (4,1), leaving the stack at 1,2,1,2. The algorithm continues the search from vertex 1, visiting the last edge (1,2). (Note: only unprocessed vertices are pushed into the stack.) □
Vertex degree and degree sequence
The degree of a vertex v e V in a graph G = (V,E) is the number of edges from E incident to v. It is denoted by
degn.
The degree sequence of a graph G with vertices V = (iii,..., ii„) is the sequence
(degiii,degii2,.. .,degn„).
Sometimes, it is required that the sequence be sorted in ascending or descending order rather than correspond to the selected order of vertices.
In the case of directed graphs, distinguish between the indegree deg+ n of a vertex n and its outdegree deg_ n. A directed graph is said to be balanced if and only if for all vertices v deg_ n = deg+ v.
The degree sequence of a graph (and its isomorphic copies) is unique up to permutation. Therefore, if the degree sequences of two graphs differ not merely by permutation, then the graphs are not isomorphic. The converse statement is not true in general. Two non-isomorphic graphs with the same degree sequence are the graph G = C3 U C3 which has degree sequence (2,2,2,2,2, 2), and Cq. However, Cq contains a path of length 5, but C3 UC3 does not contain a path of length 5. Therefore, these two graphs cannot be isomorphic.
Since every edge has two endpoints, it is counted twice in the sum of the degree sequence (this condition is sometimes known as handshaking lemma). It follows that
(1) ^degu = 2|£|.
vev
In particular, the sum of the degree sequence must be even. The following theorem2 of Havel and Hakimi is a first re-®,       su^ ar,out operations with graphs. The proof is 2f£jY      constructive. It describes an algorithm for con-■SS^isSt? structing a graph with a given degree sequence if there is one, or shows that there is no such graph.
Deciding about a given degree sequence
Theorem. For any natural numbers 0 < di < ■ ■ ■ < dn, there exists a graph G on n vertices with the above values as its degree sequence if and only if there exists a graph on n — 1 vertices with degree sequence
(di, d2, . . . , dn-d„ — 1, ^n-d„+l ~ 1, • • • , d~n-l ~ !)■
Proof. If there exists a graph G' on n — 1 vertices with degree sequence as stated in the theorem, then a new vertex vn can be added to G'. Connect vn with edges to the last dn vertices of G', thereby obtaining a graph G with the desired degree sequence.
2proved independently by Václav J. Havel in 1955 in the Časopis pro pěstováni matematiky (in Czech) and S. L. Hakimi in 1962 in the Journal of the Society for Industrial and Applied Mathematics
884
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Remark. If we had chosen the opposite edge priority, the edges would have been visited in the following order: (5,2),
(2,1), (1,4), (4,3), (3,2), (3,6), (6,1), (6,5), (4,5). Intuitively, the depth-first search can be perceived so that the algorithm examines the first undiscovered edge in each step.
13.B.3. Let the vertices of the graph KG be labeled 1, 2,..., 6. Write the order of edges of KG in which they are visited by the depth-first search algorithm, supposing the search is initiated from vertex 3 and the neighbors of a given vertex are visited in ascending order. O
13.B.4. Let the vertices of the graph KG be labeled 1, 2,..., 6. Write the order of edges of KG in which they are visited by the breadth-first search algorithm, supposing the search is initiated from vertex 3 and the neighbors of a given vertex are visited in ascending order. O
13.B.5. Apply Dijkstra' algorithm to find the shortest path from vertex number 9 to each of the other vertices.
ft
18   1    2    17 1
3 1
1 1 8 15 2 7 5
@- 2 -(J)- 7 -^)- 3 -)?)- 4 -^) (gf
o
13.B.6.   Give an example of
i) a graph on at least 4 vertices which does not contain a negative cycle, yet Dijkstra's algorithm fails on it;
ii) a graph on at least 4 vertices which contains a negative edge, yet Dijkstra's algorithm succeeds on it.
Solution. In both cases, we must be well aware how Dijkstra's algorithm works. Then, it is easy to find the wanted examples (apparently, there are many more possibilities). As for the first problem, we can consider the following graph (where S is the initial vertex):
i
i
The reverse implication is more difficult. The following needs to be proved. Suppose a fixed degree sequence (d1,..., dn) with 0 < d1 < ■ ■ ■ < dn is given. Then there exists a graph whose vertex vn is adjacent to exactly the last dn verticesvn-dn,vn_1.
The idea is simple - if any of the last dn vertices is not adjacent to vn, then vn must be adjacent to one of the prior vertices. The idea is to interchange the endpoints of two edges so that the vertices vn and vk become adjacent and the degree sequence remains unchanged.
Technically, this can be done as follows: Consider all graphs G with a given degree sequence and let, for each G, v(G) denote the greatest index of a vertex Sf which is not adjacent to the vertex vn. Fix G to 'J be such that iAG) is as small as possible. Then, either v(G) = n — dn — 1 (and the graph is obtained) or v{G) > n — dn.
If the latter is true, then vn is adjacent to one of the vertices Vi, i < v(G). Since, deg ii„(g) > degiij, there exists a vertex vi which is adjacent to iv(g)> but not to v{. Replace the edge {vi, «i,(g)} f°r {ve, Vi} as well as {v{, vn} for {vv(G) i vn }, to get a graph G' with the same degree sequence, but with v(G') < v(G), which contradicts the choice of G. (Draw a diagram!)
Therefore, the former possibility is true. So the graph is created by adding the last vertex and connecting it to the last dn vertices with edges. □
The procedure reveals that the degree sequence of a graph falls far short of determining the graph.
The theorem describes an exact procedure for constructing a graph with a given degree sequence. If there is no such graph, the algorithm so indicates during the computation. Begin with the degree sequence in (say) ascending order. Then delete the largest value d and subtract one from d remaining values on the very right. Then sort the obtained degree sequence and continue with the above step until either there is an example of a graph with the current degree sequence or the degree sequence does not correspond to any graph. If, eventually, a graph is constructed after a number of steps, then one can reverse the procedure, adding one vertex in each step, connected to those vertices where ones were subtracted during the procedure. (Try examples by yourself!) The algorithm constructs only one of the many graphs which share the same degree sequence.
13.1.8. Matrix representation. The efficiency of graph representations is of importance for running algo-; /    rithms. One of them is useful in theoretical considerations:
885
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
If Dijkstra's algorithm is run from S, then it visits the vertex A and fixes its distance from S to 1. However, there is a shorter path, namely the path (S, B, C, A) of length 0. As for the second problem, consider the following:
Adjacency matrix
-1
□
Bellman-Ford algorithm. This algorithm is based on the same principle as Dijkstra's. However, instead of going through particular vertices, it processes them "simultane-ously"-the relaxation loop (i. e., finding out whether the temporary distances of the vertices can be improved using a given edge) is iterated (| V| —1)-times over all edges. Theadvantage is that this approach works even with negative edges, and it is able to detect negative cycles (if another iteration of the relaxation loop leads to a change, then there must be a negative cycle in the graph). However, we pay for that with increased time complexity.
13.B.7. Use the Bellman-Ford algorithm to find the shortest paths from the vertex S to all other vertices. Assume that the edges are ordered by the number of the tail (or head) and the initial vertex is the least one. Then, change the value of the edge (8, 6) from 18 to —18, execute the algorithm on this new graph, and show the detection of negative cycles, -bbox -> ps2pdf-dEPSCrop
Solution. According to the conditions, the edges are visited in the following order: (S,4), (S,7), (1,2), (1,5), (2,1), (2,3), (2,6), (3,7), (4,7), (4,8), (5,1), (5,6), (6,2), (6,5), (7,8), (8,6). The vertex distances (potential higher values computed earlier
The adjacency matrix of the (undirected) graph G = (V, E) is defined as follows. Fix a (total) ordering of its vertices
v = k
Define the matrix AG = (a^) over Z2
(entries are zero and ones)
1 if the edge e{j 0   if the edge e{j
{vi,Vj} e E, {vi,Vj} E.
It is recommended to write explicitly the adjacency matrices of the graphs mentioned at the beginning of this chapter! By definition, adjacency matrices are symmetric.
There are straightforward generalizations of this concept for more general graphs. For oriented edges their directions may indicated by the sign, multiple edges might be encoded by appropriate integers, etc.
If the matrix is stored in a two-dimensional array, then this method of graph representation is very inefficient. It consumes 0{n2) memory. However, if the graph is rather sparse, i.e. there are only a few edges, and then almost all of the entries of the matrix are zeros. There are many methods of storing such matrices more efficiently.
The matrix representation of graphs is suggestive of linear algebra considerations. For example, there is the following beautiful theorem:
Theorem. Let G = (V,E) be a graph with vertices ordered as V = (i>i,..., vn), and let Ac be its adjacency matrix. Further, let AG = (a^) denote the entries of the k-th power of the matrix Aq =
(k)
Then, a\- is the number of walks of length k between the vertices Vi andvj.
Proof. The proof is by induction on the length of the walks. For k = 1, the statement is simply a reformulation of the definition of the adjacency matrix. Suppose the proposition holds for a fixed positive integer k. Examine the number of walks of length k + 1 between the vertices v{ and Vj for some fixed indices i and j. Each walk can be obtained by attaching an edge from v{ to a vertex vi to a walk of length k between vi and Vj. Further, each walk of length k + l can be
(k)
obtained uniquely in this way. Therefore, if a\.' denotes the number of walks of length k from vi to Vj, then the number of walks of length k + 1 is
ik+i)
Y üie' a
(k)
=i
This is exactly the formula for the product of the matrix Ac
and the power AG. It follows that the entries of the matrix
(fc+i)
AG+1 are the integers a\
□
Corollary. If G = (V,E) and Ac are as above, then each pair of vertices in G is connected by a path if and only if the matrix (A + En)n_1 has only positive entries (En is the n-by-n identity matrix).
886
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
during the same iteration are written in parentheses):
	S	1	2	3	4	5	6	7	8
1	0	oo	oo	oo	1	oo	22	3(6)	4
2	0	oo	23	oo	1	24	22	3	4
3	0	25(30)	23	26	1	24	22	3	3
4	0	25	23	26	1	24	22	3	3
Proof.
+ (n;1)An-2 + ---+C:zl)A + En.
Since the fourth iteration does not lead to any change, we can terminate the algorithm at this moment.
In the changed graph, the execution is as follows (for the sake of clarity, we do not write the values of vertices that are untouched by the change):
	S	1	2	3 4	5	6	7	8
1	0	oo	oo	oo 1	oo	-14	3(6)	4
2			-13		-12			
3		-ll(-6)		-10		-19	-2	-1
4			-18		-17			
5		-16		-15		-24	-7	-6
6			-23		-22			
7		-21		-20		-29	-12	-11
8			-28		-27			
9		-26		-25		-34	-17	-16
The graph has 9 vertices, and since the ninth iteration changed the distance of one of the vertices, there is a negative cycle. Of course, we could have terminated the algorithm much earlier if we had noticed exactly what changes took place between the particular steps. Clearly, the values of the vertices 1, 2, 3, 5, 6, 7, 8 keep decreasing below all bounds. The algorithm can also be implemented so that it produces the tree of shortest paths and also finds the vertices lying on a negative cycle if there is one. □ Paths between all pairs of vertices. We often need to know the shortest paths between all pairs of vertices. Of course, we could apply the above algorithms to all initial vertices. However, there is a more effective method to do this. One of the possibilities is to use the similarity with matrix multiplication, which is the basis of the Floyd-Warshall algorithm (the best-known among algorithms of the all pairs shortest paths type), which:
• computes the distances between all pairs of vertices in time 0(n3);
• starts with the matrix Uo = A = (a^) of edge lengths (setting uu = 0 for each vertex i) and then iteratively computes the matrices Uo, U\,..., U\v\, where Uk(i,f) is the length of the shortest path from i to j such that all of its inner vertices are among {1,2,..., k};
• the matrices are computed using the formula
Uk{i,j) = min{ufc_i(j, j),ufc_i(i,fc) +        (k, j)}.
1 ' 1 \n-2l
The entries of the resulting matrix are (using the notation as above)
(n-l) ,
.   (n-l\ (n-1-
+
+ (n- l)a,ij + Sij,
where 8 a = 1 for all i, and 8{j = 0 for i =^ j.
This gives the sum of numbers of walks of length 0,..., n — 1 between the vertices v{ and Vj, multiplied by positive constants. Therefore, it is non-zero if and only if there is a path between these vertices. □
Remark. Observe how permuting the vertices of V affects the adjacency matrix of the corresponding graph. It is not hard to see that each such permutation permutes both the rows and columns of the matrix AG in the same way. Such a permutation can be given uniquely by the permutation matrix, each of whose rows and columns contain zeros only except for one entry, which is 1. If P is a permutation matrix, then the new adjacency matrix of the isomorphic graph G' is
Ac = P ■ Ac ■ PT,
where (the dots stand for matrix multiplication). The transposed matrix PT is also the inverse matrix to P, since permutation matrices are orthogonal. Every permutation can be written as a composition of transpositions; hence every permutation matrix can be obtained as the product of the matrices corresponding to the transpositions.
Of course, this is exactly how the matrices of linear mappings change under the change of basis. Understanding the adjacency matrix as a linear mapping is often useful. For example, the adjacency matrix may be thought of as hitting vectors of zeros and ones (imagine the ones indicating active vertices of interest) and yielding vectors of integers (showing how many times the given vertices are arrived at from all active vectors along the edges in one step).
This observation also shows that the question whether two adjacency matrices describe isomorphic graphs is equivalent to asking for the equivalence of the matrices via a permutation matrix P.
13.1.10. Connected components of a graph. Every graph G = (V,E) naturally partitions into disjoint subgraphs Gi such that two vertices neG, and G Gj are connected by a path if and only if
= 3-
This procedure can be formalized as follows: Let G = (V,E) be an undirected graph. Define a relation ~ on the set V. Set v ~ w for vertices v,w e V if and only if there exists a path from v to w in G. This relation is clearly a well-defined equivalence relation. Every class [v] of this equivalence determines the induced subgraph G[v] C G, and the (disjoint) union of these subgraphs actually gives the original graph G. According to the definition of an equivalence
887
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
In other words, considering the shortest path from i to j which can go only through the vertices 1,..., k, we can ask whether it uses the vertex k. If so, then this path consists of the shortest path from i to k and the shortest path from k to j (and these two paths use only the vertices 1,..., k — 1). Otherwise, the wanted path is also the shortest path from i to j which can go only through the vertices 1,..., k — 1. Clearly, for k = V |, we get the shortest paths between all pairs of vertices without any restrictions. Moreover, we can maintain the so-called predecessor matrix (i. e., the predecessor of each vertex on the shortest path from each vertex and update it as follows:
• Initialization:
(Po)ij = i for i =/ j and a^- < oo,
• In the fc-th step, we update
y.     l(Pk-i)kj,   if the path through k is better, 13    \(Pk-i)ij, otherwise. As soon as the algorithm terminates, we can easily construct the shortest path between any pair of vertices u, v: we derive it from the matrix P = Pn = (pij) (in the reverse order) as
V, W      Puv: Puw, - - - ■
13.B.8. Apply the Floyd-Warshall algorithm to the graph in the picture. Write the intermediate results into matrices. Show the detection of negative cycles. Maintain all information necessary for the construction of the shortest paths.
relation, no edge of the original graph can connect vertices that belong to different components. The subgraphs G[„] are called connected components of the graph G.
A graph G = (V, E) is said to be connected if and only if it has exactly one connected component.
If the graph G is directed, then the definition is analogous to the case of undirected graphs - it is only required that there exist both paths from vlow and from w to v in order for the pair (v, w) to be related. Using this definition, strongly connected components can be discussed. On the other hand, it may only be required that the symmetrization of the graph be connected (in the undirected sense); then weak connectedness can be discussed.
13.1.11. Multiply connected graphs. It is useful to consider the concept of connectedness in a much stronger sense, i.e. to enforce a certain redundancy in the number of paths between vertices.
Definition. An (undirected) graph G = (V, E) is said to be
• k-vertex-connected if and only if it has at least k + 1 vertices, and remains connected whenever any k—1 vertices are removed;
• k-edge-connected if and only if it has at least k edges, and remains connected whenever any k—1 edges are removed.
In the case k = 1, the definition simply says that the graph is connected (in both cases) since the condition is vacuously true. Stronger graph connectedness is desirable with any networks supporting some surfaces (roads, pipelines, internet connection, etc.) where the clients prefer considerable redundancy of the provided service for the case if several connections in the network (i.e. edges in a graph) or nodes in the network (vertices in a graph) break down.
In general, Menger's theorem3 holds. It says that for every pair of vertices v and w, the number of pairwise edge-disjoint paths from utoic equals the minimum number of edges that must be removed so as to leave v and w in different components of the new graph. Similarly, the number of pair-wise vertex-disjoint paths from utow equals the number of vertices that must be removed in order to disconnect v from w.
We return to this topic in subsection 13.2.13. Right now, we consider the simplest interesting case in detail. These are graphs (on at least three vertices) such that deleting any one vertex does not destroy the connectedness.
Theorem. IfG = (V,E) has at least three vertices, then the following conditions are equivalent:
• G is 2-vertex-connected;
• every pair of vertices v and w in G lie on a common cycle;
• the graph G can be constructed from the triangle K3 by repeatedly adding and splitting edges.
3Karl Menger proved this as early as in 1927; that is, before graph theory came into being.
888
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. We proceed according to the algorithm, obtaining the following shortest-length matrices and predecessor matrices:
(°	4	-3	oo^	(-		1	1	
-3	0	-7	oo		2	—	2	—
oo	10	0	3	,   Po =	—	3	—	3
\5	6	6	0/			4	4	-/
(°	4	-3	oo^	(-		1	1	"\
-3	0	-7	oo		2	—	2	
oo	10	0	3	,   Pi =	—	3	—	3
\5	6	2	0/		V4	4	1	-/
(°	4	-3	oo\		(~	1	1	"\
-3	0	-7	oo		2	—	2	
7	10	0	3	,   P2 =	2	3	—	3
V3	6	1	0/			4	2	-/
(°	4	-3	°\		/"	1	1	
-3	0	-7	-4		2	—	2	3
7	10	0	3	,     P3 =	2	3	—	3
\3	6	-1	o,	I	V2	4	2	-/
(°	4	-3	°\	(-		1	1	3\
-3	0	-7	-4		2	—	2	3
6	9	0	3	,   Pa =	2	4	—	3
\3	6	-1	0/		V2	4	2	-/
Since there is no negative number on the diagonal of U4, there is no negative cycle in the graph. Suppose we would like to find the shortest path from vertex 3 to vertex 1, for instance: The predecessor of 1 is P4 [3,1] = 2 and the predecessor of 2 is P4 [3,2] = 4. Therefore, the wanted path is 3, 4, 2,1 and its length is Ua[3,1] = 6.
□
Hamiltonian graphs. To decide whether a given graph is Hamiltonian is an NP-complete problem. Therefore, it might be useful to have some simpler necessary or sufficient conditions for this property at our disposal.
We mention three sufficient conditions: Dirac's, Ore's, and the Bondy-ChvAAtal theorem.
Dirac: Let a graph G with n > 3 vertices be given. If each vertex of G has degree at least n/2, then G is Hamiltonian.
Ore: Let a graph G with n > 3 vertices be given. If the sum of the degrees of each pair of non-adjacent vertices is at least n, then G is Hamiltonian.
The closure of a graph G is the graph cl(G) obtained from G by repeatedly adding an edge u, v such that u, v have
Proof. If the second proposition is true, there are at least two different paths between any two vertices. So deleting a vertex cannot destroy the connectedness and the first proposition follows.
Conversely, suppose the first proposition is true. Proceed
fby induction on the minimal length of a path between v and w. Suppose first that the vertices are the end-points of an edge e, and that the shortest path is of length 1. If removing the edge e splits the graph into two components, then this would also occur if the vertex v is removed or if the vertex w is removed. Therefore, the graph is connected even without the edge e, so there is a path between v and w. This path, together with the edge e, forms a cycle.
For the induction hypothesis, assume that such a shared cycle is constructed for all pairs of vertices connected by a path whose length does not exceed k. Consider vertices v, w and one of the shortest paths between them:
(v = v0, ei,vi,.. .,vk+i = w)
of length k + 1. Then, vi and w can be connected by a path of length at most k, hence they lie on a common cycle. Denote by Pi and P2 the corresponding two disjoint paths between vi and w. Now, the graph G \ {vi} is also connected, so there exists a path P from v to w which does not go through the vertex vi, and this path must once meet either of the paths Pi, P2. Without loss of generality, suppose that this occurs on the path Pi, at vertex z. Now, the cycle can be built: it consists of the part of P from v to z, the part of Pi from ztow, and P2 (directed the other way) from w to v (draw a diagram!). It follows that the second proposition is a cosequence of the first proposition, and hence first condition is equivalent to the second one.
Suppose the third proposition is true. Neither splitting an edge nor adding a new one in a 2-vertex-connected graph destroys the 2-connectedness. So the first proposition follows from the third proposition.
It remains to prove that third proposition follows from the first proposition. From the first proposition, G is 2-connected, so there exists a cycle, which can be obtained from K3 by splitting edges. Consider the subgraph G' = (V',E/) determined by this cycle, and consider an edge e = {v, w} £ El such that one of its endpoints lies in V. If both of its end-points lie there, a new edge can simply be added to the graph G', which leads to the subgraph (V, E U {e}) in G, which contains more vertices and edges than the graph G'. Consider the remaining possibility, i.e. v e V while w <£ V. Since G is 2-connected, it remains connected even if the vertex v is removed, and it contains a shortest path P between the vertex w and some vertex (denote it as v') in G' (apart from the removed vertex v) and containing no other vertex from V. Adding this path to the graph G', together with the edge e (which can be done by adding the edge {v, v'} splitting it to the desired number of "new" vertices and edges), A new subgraph is obtained which satisfies the requirements
889
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
not been adjacent and deg(w) + deg(ii) > n until no such pair of vertices u, v exists.
Bondy, ChvAAtal: A graph G is Hamiltonian if and only if cl(G) is.
13.B.9.   Prove the Bondy-ChvAAtal theorem.
Solution. Clearly, it suffices to prove that if G is Hamiltonian after addition of an edge {u, v} such that u, v have not been adjacent and deg(u) + deg(w) > n, then it is already Hamiltonian without this edge. Suppose that G + {u, v} is Hamiltonian, but G is not. Then, there exists a Hamiltonian path from u to v in G. It must hold for each vertex adjacent to u that its predecessor on this path is not adjacent to v (otherwise, there would be a Hamiltonian cycle in G). Therefore, deg(u) + deg(i>) < n — 1. □
13.B.10.
i) Prove that the Bondy-ChvAAtal theorem implies Ore's and Ore's implies Dirac's.
ii) Give an example of a Hamiltonian graph which satisfies Ore's condition but not Dirac's.
iii) Give an example of a Hamiltonian graph whose closure is not a complete graph.
Solution.
i) If a graph G satisfies Ore's condition, then its closure is a complete graph, which is Hamiltonian, of course. By the Bondy-ChvAAtal theorem, the original graph is Hamiltonian as well. Further if G satisfies Dirac's condition, then it clearly satisfies Ore's as well and thus is Hamiltonian.
ii) Consider the following example:
5
The degree of vertex 5 is 2, which is less than §. The sum of the degrees of any pair of (not only non-adjacent) vertices is at least 5. iii) The wanted conditions are satisfied by the cycle graphs Cn, n > 4, for which cl(Cn) = Cn.
Planar graphs.
and contains more vertices than the considered graph G'. After a finite number of these steps, the entire graph G is built from the triangle K3, as desired. The proof is complete. □
13.1.12. Eulerian graphs. There are problems of the type "draw a graph without removing the pencil from the paper". In the language of graph theory, this can be stated as follows:
Eulerian trails
Definition. A trail which visits every edge exactly once and whose initial and terminal vertices are the same is called a Eulerian trail. Connected graphs that admit such a trail are called Eulerian graphs.
Of course, an Eulerian trail goes through every vertex . at least once, but it can visit a vertex more than once. To draw a graph with-]f out removing the pencil from the paper tu, while ending at the same point where one started means to find a Eulerian trail. The terminology refers to the classical story about the seven bridges in Königsberg. There, the task was to go for a walk and visit each of the bridges exactly once. The first proof that this is impossible is by Leonhard Euler, in 1736.
The situation is depicted in the diagram. On the left, there is a sketch of the river with the islands and bridges. The corresponding multigraph is caught in the right-hand diagram. The vertices of this graph correspond to the "connected land", while the edges correspond to the bridges. If it is desired to do without the multiple edges (which have not been admitted so far), it would suffice to place another vertex inside each bridge (i.e. to split the edges with new vertices). Surprisingly, the general solution of this problem is quite simple, as shown by the following theorem. Of course, this also shows that Euler could not design the desired walk.
Eulerian graphs
□
Theorem. A graph G is Eulerian if and only if it is connected and all vertices ofG have even degree.
Proof. If a graph is Eulerian, for every vertex entered there is an exit. Therefore, the degree of every vertex is even. More formally: consider a trail that begins and ends at a vertex v0 and passes through all edges. Every vertex occurs once or more on this trail and its degree equals twice the number of its occurrences.
Now suppose that all vertices of a graph G have even degree. Consider the longest possible trail (v0, e1,..., vk) in G
890
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.B.11.   Decide whether the graph in the picture is planar.
Solution. By the Kuratowski theorem (see page 899), this graph is not planar since one of its subgraphs is a subdivision
Of tf3i3. □
13.B.12. Decide whether there is a graph with degree sequence
(6, 6,6,7,7,7, 7,8,8,8). If so, is there a planar graph with this degree sequence? O
13.B.13. What is the minimum number of edges of a hexahedron?
Solution. In any polyhedron, every face is bounded by at least three edges. At the same time, every edge lies belongs to two faces. If / is the number faces and e the number of edges of the polyhedron, then we have 3/ < 2e (see also 13.1.20). For a hexahedron, this bound yields 18 < 2e, i. e., e > 9. Indeed, there exists a hexahedron with nine edges. It can be obtained by "gluing" two identical regular tetrahedra together along one face. Therefore, the minimum number of edges of a hexahedron is nine.
□
13.B.14. Decide whether the given planar graph is maximal. Add as many edges as possible while keeping the graph planar.
Solution. The graph has 14 vertices and 20 edges, hence 31VI — 6 — \E\ = 16. Therefore, it is not maximal, and 16 edges can be added so that it is still planar.
(l2V----------------------»-{13)
XQ)--X2J
where no edge occurs twice or more. First, suppose for a moment that Vk ^ v0. This would mean that the number of edges of the trail that enter or leave the vertex v0 is odd, so there must be an edge which is incident to v0 and not contained in the trail. However, then the trail can be prolonged while still using every edge of the graph at most once, which is a contradiction. Therefore, v0 = vk.
Define a subgraph G' = (V, Ef) of G as follows: It contains the vertices and edges of our fixed trail and nothing else. If V ^ V, then (since the graph G is connected) there exists an edge e = {v, w} such that v e V and w <£ V. However, then the trail can be "rotated" so that it begins and ends at the vertex v. It can be prolonged with the edge e, which contradicts the assumption of the greatest possible length. Therefore, V = V.
It remains to show that Ef = E. So suppose there is an edge e = {v, w} El. As above, the trail can be rotated so that it begins and ends at the vertex v and then goes along the edge e - a contradiction. □
Corollary. A graph can be drawn without removing the pencil from the paper if and only if there are either no vertices of odd degree or exactly two of them.
Proof. Let G be a graph with exactly two odd-degree vertices. Construct a new graph G' by attaching a new vertex w to the original graph G and connecting it to both the odd-degree vertices. This graph is Eulerian, and the Eulerian trail in it leads to the desired result.
On the other hand, if a graph G can be drawn in the desired way, then the graph G' is necessarily Eulerian, so the degrees of the vertices in G are as stated. □
The situation for directed graphs is similar. A directed graph is called balanced if and only if the outcoming and incoming degrees coincide, i.e. deg+(ii) = deg_(v), for all vertices v.
Proposition. A directed graph G is Eulerian if and only if it is balanced and its symmetrization is connected (i.e. the graph G is weakly connected).
Proof. The proof is analogous to the undirected case. (Work out the details yourself!) □
13.1.13. Hamiltonian cycles. Find a walk or cycle that visits every vertex of a graph G exactly once. Of necessity, such a walk can visit every edge at |^\fe most once. Such a cycle is called a Hamiltonian f^r^p^JL- cycle in the graph G. A graph is called Hamiltonian if and only if it contains a Hamiltonian cycle. This problem seems to be very similar to the above one of visiting every edge exactly once. But while the problem of finding an Eulerian trail is trivial, the problem of deciding whether a graph is Hamiltonian is NP-complete.
Of course, this problem can be solved by "brute force". Given a graph on n vertices, generate all n\ possible orders of the n vertices, and for each of them, verify whether it is a cycle in G.
891
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Ten (dashed) have been added. For the sake of clarity, the other 6 edges that connect the vertices of the "outer" 9-gon are not drawn.
□
13.B.15. Prove or disprove each of the following propositions.
i) Every graph with fewer than 9 edges is planar.
ii) Every graph which is not planar is not Hamiltonian.
iii) Every graph which is not planar is Hamiltonian.
iv) Every graph which is not planar is not Eulerian (see 13.1.12).
v) Every graph which is not planar is Eulerian.
vi) Every Hamiltonian graph is planar.
vii) No Hamiltonian graph is planar.
viii) Every Eulerian graph is planar, ix) No Eulerian graph is planar.
o
Trees.
13.B.16.   Determine the code of the following graph as a
i) plane tree,
ii) tree.
Solution.
i) Using the procedure from 13.1.18, we get the following code of the plane tree:
0 0 0001100100101111 10 01010000101011111 1
The highlighted vertex in the graph is indeed the appropriate candidate to be the root since it is the only element of the center of the tree.
ii) As for the unique construction of a plane tree, we sort the descendants lexicographically in ascending order. Thus, the wanted code is
00000010101111010110000010110110011111.
This problem forms a vital field of research. For instance, in 2010, A. Bjorklund published a randomized algorithm based on the Monte Carlo method, which counts the number of Hamiltonian cycles in a graph on n vertices in time
0(l,567n).4
Finding Hamiltonian cycles is desired in many problems related to logistics. For example, finding optimal paths for goods delivery.
13.1.14. Trees. As mentioned earlier, redundancies often need strengthening in the connection.
Sometimes it is desired to minimize the number of edges in the graph while keeping it connected. Of course, this is possible while there is at least one cycle in the graph.
Forests, Trees, Leaves
A connected graph which does not contain a cycle is called a tree. A graph which does not contain a cycle is called a forest (a forest is not required to be connected).
Every vertex of degree one in any graph is called a leaf.
The definition suggests an easily memorable "theorem": A tree is a connected forest. Similarly, the following lemma proves "theorems": There are very few trees without leaves; and every tree can be built by adding enough leaves to its root.
Lemma. Every tree with at least two vertices contains at least two leaves.
For any graph G with a leafv, the following propositions are equivalent:
• G is a tree;
• G\v is a tree.
Proof. Let P = (v0,..., vk) be (any) longest possible path in a tree G. If the vertex v0 is not a leaf, then there is an edge e incident to it whose other endpoint v is not in P since this would form a cycle in the tree. Then the path P with this edge could be prolonged, which contradicts '"longest"'. So the vertex v0 is a leaf. The proof for the vertex vk is similar.
Conversely suppose that v is a leaf of a tree G. Consider any two other vertices w, z in G. There exists a path between them, and no vertex on this path has degree one. Therefore, this path remains the same inG\v. Hence the graph remains connected even after removing the vertex v. There is no cycle, since it is constructed by removing a vertex from a tree.
Conversely, if G\v is a tree, then adding a vertex with degree 1 cannot create a cycle. The resulting graph is evidently connected. □
Trees can be characterized by many equivalent and useful properties. Some of them appear in the following theorem which is more difficult to formulate than to prove.
Bjorklund, Andreas (2010), "Determinant sums for undirected Hamiltonicity", Proc. 51st Impartial Symposium on Foundations of Computer Science (FOCS '10), pp. 173-182, arXiv: 1008.0541, doi:10.1109/FOCS.2010.24.
892
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
□
13.B.17. For each of the following codes, decide whether there exists a tree with this code. If so, draw the corresponding tree.
• 00011001111001,
• 00000110010010111110010100001010111111.
o
Huffman coding. We are working with plane binary trees where every edge is colored with a symbol of an output alphabet A (we often have A = {0,1}). The codewords C are those words over the alphabet A to which we translate the symbols of the input alphabet. Our task is to represent a given text using suitable codewords over the output alphabet.
We can easily see that it makes sense to require that the set of codewords be prefix, (i. e., no codeword can be a prefix of another one); otherwise, we could get into trouble when decoding.
We will use binary trees for the construction of binary prefix codes (i. e. over the alphabet A = {0, 1}). We label the edges going from each vertex by 0 and 1. Further, we label the leaves of our tree with symbols of the input alphabet. This results in a prefix code over A for these symbols by concatenating the edge labels along the path from the root to the corresponding leaf.
Clearly, this code is prefix. Moreover, if we take into account the frequencies of particular symbols of the input alphabet in the input text, we obtain a lossless data compression.
Let M be the list of frequencies of the symbols of the input alphabet in the input text. The algorithm constructs the optimal binary tree (the so-called minimum-weight binary tree) and the assignment of the symbols to the leaves.
• Select the two least frequencies w1,w2 from M. Create a tree with two leaves labeled by the corresponding symbols and root labeled by w1+w2, then replace the values w1,w2 with the new value w1 + w2 in M.
• Repeat the above step; if the selected value from M is a sum, then simply "connect" the existing subtree.
• The code of each symbol is determined by the path from the root to the corresponding leaf (left edge = "0", right edge = "1", for instance).
13.1.15. Theorem. Let G = (V,E) be a graph. The following conditions are equivalent:
(1) G is a tree.
(2) For every pair of vertices v, w, there is exactly one path from v to w.
(3) G is connected but ceases to be such if any edge is removed.
(4) G does not contain a cycle, but the addition of any edge creates one.
(5) G is a connected graph, and the number of its vertices and edges satisfies
\V\ = \E\+1.
Proof. The properties 2-5 are satisfied in every tree. By the previous lemma, every tree which has at least two vertices has a leaf v. It continues to be a tree when this leaf v is removed. Therefore, it suffices to show that if any of the statements 2-5 is true for a given tree, then it holds when a leaf is added to the tree as well. This is clear.
In the case of properties 2 and 3, the graph is connected, and their formulation directly excludes the existence of cycles. As for the fourth property, it suffices to verify that G is connected. However, any two of vertices v, w in G are either connected with an edge, or adding this edge to the graph creates a cycle. So there exists a path between them even without this edge.
The last implication can be proved by induction on the number of vertices. Suppose that all connected graphs on n vertices and n — 1 edges are trees. The sum of vertex degrees of any graph on n +1 vertices and n edges is In, so the graph must contain a leaf. It follows from the induction hypothesis that this graph can be constructed by attaching a leaf to a tree; hence it is also a tree. □
13.1.16. Rooted trees, binary trees, and heaps. Trees are ?j»a often suitable structures for data storage. They \4p)      permit basic database operations (eg. finding a
<rf&Kt^ particular piece of information) efficiently. '^wSSpee&here is no cycle in a tree, fixing one vertex vr defines the orientation of all edges. For every vertex v, there is exactly one path from vr to v, so the orientation can be defined accordingly. Since there are no cycles, it is impossible for two such paths to force both orientations of a particular edge.
If one of the vertices of a tree is fixed, the situation is similar to a real tree in nature - there is a distinctive vertex which "grows from the ground". Trees with a fixed distinguished vertex vr are called rooted trees, and vr is said to be the root of the tree.
In a rooted tree, the terms successor and predecessor of a vertex are denned as follows: a vertex w is a successor of v (or v is a predecessor of w) if and only if the path from the root of the tree to the vertex w goes through v and v ^ w. If the vertices are directly connected with an edge, we can talk about a direct successor and a direct predecessor. More
893
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.B.18.   Find the Huffman code for the input alphabet with the frequencies
['A' :16, 'B' :13, 'C :9, 'D' :12, 'E' :45, ' F' :5]. Solution. If we naively assign a 3-bit code to each letter of the alphabet, then this message of length 100 consumes 300 bits.
We show that Huffman code is more succinct. We build the tree according to the algorithm.
0
*EA5
1
ABCDF:55
0
•"B:13
BD:25 1
0
*¥:5
A
0
FC:14
1
% C:9
AFC:30 1
% A: 16
We have thus obtained the codes A : 111,5 : 100, C : 1101, D : 101, E : 0, F : 1100. Multiplying the code lengths by the frequencies, we can see that a 100-letter message with the given distribution of letters is encoded into only
3 ■ 16 + 3 ■ 13 + 4 ■ 9 + 3 ■ 12 + 1 ■ 45 + 4 ■ 5 = 224 bits. □
C. Minimum spanning tree
13.C.1. How many spanning trees (see 13.2.6) of the graph K5 are there? And how many are there if we do not distinguish isomorphic ones?
Solution. There are three pairwise non-isomorphic spanning trees (with degree sequences (1, 2,2,2,1), (1,2,3,1,1), (4,1,1,1,1)). The corresponding classes of isomorphic spanning trees have 5 • (g) • 2, 5 • 4 • 3, and 5 elements, respectively. Altogether, there are 125 = 53 spanning trees, which is in accordance with Cayley's formula for the number of spanning trees of a complete graph (see 13.4.11). □
13.C.2. Let the vertices of KG be labeled 1, 2,..., 6 and let every edge {i, j} be assigned the integer [(i + j) mod 3] + 1. How many minimum spanning trees are there in this graph? Solution. There are five edges whose value is 1: four of them lie on the cycle 12451 and the remaining one is the edge 36. Therefore, they form a disconnected subgraph of the complete
often, they are called a child and a parent (motivated by the genealogical trees).
The most common data structures are the binary trees, which are special cases of a rooted trees: there, every vertex has at most two children (sometimes, the term binary tree implies that every vertex is either a leaf, or has exactly two children; to avoid ambiguity, such trees are often called full binary trees). If the vertices are associated to keys from a totally ordered set (eg. the integers), the search for the vertex with a given key is performed by searching the path from the root of the tree to that vertex. At every vertex, compare its key to the desired one. This decides whether one continues to the left or to the right, or stop the search if it is found. If this algorithm is to be correct, one of the children with all its successors must have lower keys than the keys of the other child and all its successors.
In order for the search to be efficient, some effort must be made to keep the binary trees balanced, with the lengths of the paths from the root to the leaves differing by at most one. The most unfortunate example of a binary tree on n vertices is the path graph Pn (which may be formally considered a binary tree), while the most desired case is the perfect complete binary tree, where every vertex that is not a leaf has exactly two children, and all leaves are at the same level. Such a tree can be constructed only when the number of vertices is of the
form n = 2fe — l,fc = l,2,____      Therefore, in a balanced
tree, finding the vertex with a given key value can be done in 0(log2 n) steps. Such trees are often called binary search trees. Think out as an exercise how to efficiently perform basic graph operations over binary search trees (additions and removals of the vertex with a given key as well as how to keep the tree balanced).
An extraordinarily useful example of binary trees is the structure of a heap. It is a full balanced binary tree, where the keys are either strictly decreasing along each path from the root (the so called max heap), or they are inreasing (the min heap). Because of this ordering along the paths in a max heap, the maximum key value of the heap can be found in constant time and removed in logarithmic time. ( similarly with minimum in the min heap). The desired maximum is just at the root and the heap needed to be kept in the desired balanced shape when the root has been removed. Prove this is possible in logarithmic time yourself!
The left-hand diagram shows a binary search tree. In the right-hand diagram, there is a max heap.
Much literature is devoted to trees, their applications and miscellaneous variations.
894
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
graph, so the spanning tree must contain at least one edge of value 2. Thus, the total weight of a minimum spanning tree is at least 4 1 + 2 = 6. And indeed, there exist spanning trees with this weight. We select all the edges of value 1 except for one that lies on the mentioned cycle and connect the resulting components 1245 and 36 with any edge of value 2. There are four such edges. Altogether, there are 4 ■ 4 = 16 minimum spanning trees. □
13.1.17. Remarks on sorting. Suppose it is required to distinguish all different sortings of n elements, />i>~v.--p. thus distinguishing among ti! different objects. ''^jr"-- If there is no information other than comparing y}4&t- the order of two single elements, then the tree of all possible decision paths can be written down. The sorting provides a path through this binary tree. As seen, any binary tree of depth h has at most 2h — 1 leaves. It follows that a tree of height h satisfying 2h — 1 > ti! is needed.
Consequently the depth h satisfies h log 2 > log ti!.
13.C.3. Find a minimum spanning tree of the following graph using
i) Kruskal's algorithm,
ii) Jarnik's (Prim's) algorithm.
Explain why we cannot apply Boruvka's algorithm directly.
A"3^ X K
4    12    17 1 3 1
1 1 5 3 2 2 5
„.u-U-U-y
v
Solution. The spanning tree is
V
A A  I T\
4    12    1       1 3 1
3 2
\ 1
Boruvka's algorithm cannot be applied directly since the mapping of the weights to the edges is not injective. However, this can be fixed easily by slight modifications of the weights.
□
13.C.4. Consider the following procedure for finding the shortest path between a given pair of vertices in an undirected weighted graph: First, we find a minimum spanning tree. Then, we proclaim that the only path between the pair of vertices in the obtained spanning tree is the shortest one. Prove the correctness of this method, or disprove it by providing a counterexample. O
13.C.5. We are given the following table of distances of world metropolises: London, Mexico City, New York, Paris,
log ti! = log 1 + log 2 + ■ ■ ■ + log n > /   log x dx = n log n — (ti — 1)
ti log 71 — 71 h > - > 71 log 71 — 71
log 2
It is proved that the depth of the necessary binary tree is bounded from below by an expression of size n log ti. Hence no algorithm based only on the comparison of two elements of the ordered set can have a better worst case run than
0(n log ti).
The latter claim is not true if there is further relevant information. For example, if it is known that only a finite number of k values may appear among our ti elements, then one may simply run through the list counting how many occurrences of the individual values are there, and hence write the right ordered list from scratch. This all happens in linear time!
13.1.18. Tree isomorphisms. Simple features of trees are j+ ,, exploited in order to illustrate the (generally difficult) problem of graph isomorphisms on this special class of graphs.
First, strengthen the structure to be preserved by the isomorphisms. Then show that the obtained procedure is also applicable to the most general trees.
In order to keep more information about the structure of rooted trees, remember the relations parent-child. Also have the children of every node sorted in a specific order (for instance, from left to right if drawn on a diagram). Such trees are called ordered trees or plane trees. They are formally defined as a tuple T = (V, E,vr, u), where v is a partial order on the edges such that a pair of edges is comparable if and only if they have the same tail (i.e. they all go from one parent vertex to all its children).
A homomorphism of rooted trees T = (V,E,vr) and V = (V, E, v'r) is a graph morphism p:T -> T" such that vr is mapped to v'r; similarly for isomorphisms. For plane trees, it is further required that the morphism preserves the partial orders v and \J.
895
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Peking, and Tokyo:
/ L	MC	NY	P	Pe	T \
L	5558	3469	214	5074	5959
MC		2090	5725	7753	7035
NY			3636	6844	6757
P				5120	6053
\ Pe					1307/
What is the least total length of wire used for interconnecting these cities (assuming the length necessary to connect a given pair of cities is equal to the distance in the table). O
13.C.6. Using the matrix variation of the JarnAnk-Prim algorithm, find a minimum spanning tree of the graph given by the following matrix:
(-	12	-	16	-	-	-	13\
12	-	16	-	-	-	14	-
—	16	—	12	—	14	—	—
16	—	12	—	13	—	—	—
-	-	-	13	-	14	-	15
—	—	14	—	14	—	15	—
—	14	—	—	—	15	—	14
\l3	-	-	-	15	-	14	
o
D. Flow networks
13.D.1. An example of bad behavior of the Ford-Fulkerson algorithm. The worst-case time complexity of the Ford-Fulkerson algorithm is 0(E ■ |/|), where |/| is the size of a maximum flow. Consider the following network:
0/100 0/100
< °' >
0/100^0/100
The bad behavior of the algorithm is due to the fact that it uses depth-first search to find unsaturated paths.
Solution. We proceed strictly by depth-first search (examining the vertices from left to right and then top down):
1/100     0/100 1/100 1/100
< 7 > < 7 >
0/100     1/100 1/100 1/100
2/100     1/100 2/100 2a
1/1
2/100 2/100
< 7 > < 7 >
1/100     2/100 2/100 2/100
Coding the plane trees
Given the plane tree T = (V, E, vr, v). It has a code W by strings of ones and zeros, denned recursively as follows:
Start with the word 01 for the root v0 and write W = QW\... Wei, where Wi axe the I still unknown words for the subtrees rooted by the children of v0. In particular the code of the tree with just one vertex is W = 01.
Applying the same procedure recursively over the children and concatenating the results defines the code.
The tree in the left-hand diagram above is encoded as follows (the children of a vertex are ordered from left to right, Wr is for the code of the child with key r):
OWWsl OOWiWslOWgll
^ OOOlOW-WellOOWiilll
m- 000100101110001111.
Imagine drawing the entire plane tree with one move of the pencil, starting with an arrow ending in the root and going downwards with arrows towards the leaves and then upwards to the predecessors, reaching consecutively all the leafs from the left to the right and writing 0 when going down and 1 when going up. The very last arrow is then leaving the root upwards.
Theorem. Two plane trees are isomorphic if and only if their codes are the same.
Proof. By construction, two isomorphic trees are assigned the same code. It remains to show that different codes lead to non-isomorphic plane trees.
This is proved by induction on the length of the code (i.e. the number of zeros and ones). This length is 2{\E\ + 1), (twice the number of vertices; therefore, the proof can be viewed as an induction on the number of vertices of the tree T. The shortest code corresponds to the smallest tree on one vertex. Assume that the proposition holds for all trees up to n vertices, i.e. for codes of length up to k = In, and consider a code of the form 0W1, where W is a code of length In. Find the shortest prefix of W\ which contains the same number of zeros and ones (when drawing a diagram of the tree, this is the first moment when we return to the root of the tree that corresponds to the code 0W1). Similarly, find the next part of the code W that contains the same number of zeros and ones, etc. Hence the code W can be written as W = W\ W2 ■ ■ ■ We. By the induction hypothesis, the codes Wi correspond uniquely (up to isomorphism) to plane trees, and the order of their roots, being the children of our tree T, is given uniquely by the order in the code. Therefore, the tree T is determined uniquely by the code 0W1 up to isomorphism. □
Use the encoding of plane trees to encode any tree. Deal first with the case of rooted trees. Deter-% mine the order of the children of every vertex uniquely up to isomorphism. The order is unimportant if and only if the subgraphs determined by the respective children are isomorphic.
896
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
.../100    .../100 100/100 100/100
<     > < 7 >
.../100    .../100 100/100 100/100
We can see that 200 iterations were needed in order to find the maximum flow. □
13.D.2. Find the size of a maximum flow in the network given by the following matrix A, where vertex 1 is the source and vertex 8 is the sink. Further, find the corresponding minimum cut.
/-   16   24   12   -   -   - -\
----30---
----96 12-
-----   - 21-
-----9-15
------   - 9
-------18
A ■
v-
7
Solution. The following augmenting semipaths are found:
1-2-5-8 with residual capacity 15.
1-2-5-6-8 with residual capacity 1.
1-3-5-6-8 with residual capacity 8.
1-4—7-8 with residual capacity 12.
1-3-7-8 with residual capacity 6.
The total size of the found flow is 42. We can see that it is indeed of maximum size from the fact the cut consisting of edges (5, 8), (6,8), and (7, 8) has also size 42 (and it is thus a minimum cut). □
13.D.3. The following picture depicts a flow network (the numbers f/c define the actual flow and the capacity of a given edge, respectively). Decide whether the given flow is maximum. If so, justify your answer. If not, find a maximum flow and describe the used procedure in detail. Find a minimum cut in the network.
4 ■ -3 ■
The same construction can be used as for the plane trees, ordering the vertices lexicographically with respect to their codes. This means that codes W\, W2 satisfy W\ > W2 if and only if W\ contains a one at an earlier position than W2 or W2 is a prefix of W\. The rooted tree as a whole is described by the same recursive procedure: if the children of a vertex v are coded by W\,..., We, then the code of the vertex v is
OWi... Wel,
where the order is selected so that W\ < W2 < ■ ■ ■ < We.
If no vertex is designated to be the root of a tree, the root can be designed so that it would be almost "in the middle" of the tree. This can be realized by assigning an integer to every vertex of the tree which describes its eccentricity. That eccentricity exT (v) of a vertex v in a graph T is defined to be the greatest possible distance between v and some vertex w in T. This concept is meaningful for all graphs; however, by the absence of cycles in trees, it is guaranteed that there are at most two vertices with the maximal eccentricity.
Lemma. Let C(T) be the set of those vertices of a tree T whose eccentricity is minimal. Then, C(T) contains either a single vertex, or exactly two vertices, which are connected with an edge in T.
Proof. The claim is proved by induction, using the trivial fact that the most distant vertex from any vertex v must be a leaf. Therefore, the center of T coincides with the center of the tree T" which is created from the tree T by removing all its leaves and the corresponding edges. After a finite number of such steps, there remains either just one vertex, or a subtree with two vertices. □
C(T) determined by the latter lemma is called the center of the graph, and the minimal eccentricity is called the radius of the graph.
A unique (up to isomorphism) code can now be assigned to every tree. If the center of T contains only one vertex, use it as the root. Otherwise, create the codes for the two rooted subtrees of the original tree without the edge that connects the vertices of the center, and the code of T is the code of the rooted tree (T, x), where x is the vertex of the center whose subtree has lexicographically smaller code.
Corollary. Trees T and X" are isomorphic if and only if they are assigned the same code.
The above ideas imply that the algorithm for verifying tree isomorphism can be implemented in linear time with respect to the number of vertices of the trees.
The trees form a special class of graphs. They are often used in miscellaneous variations and with additional requirements. We return to them later, in connection with practical applications.
Now follows another extraordinarily important class of graphs.
897
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. In the given network, There exists an augmenting (semi)path 1-2-3^1-8 with residual capacity 4. Its saturation results in a flow of size 32.  Since the cut
(3,8), (5, 8), (2,4), (6,4) is of the same size, we have found a maximum flow. □
13.D.4. Find a maximum flow and a minimum cut in the following flow network (source=l,sink=14).
T)-4/6-*i 9
12/30 2/2
0/2    0/14    2/2 12/18
(j)~ 10/10^5^2/18
18/18 8/8
►^K)^6/6
8/20    0/8    4/4 22/32
Solution. The paths are saturated in the following order:
1 ^> 2 ^> 7 ^> 10 ^ 5 ^> 8 ^ 11 4 13 ^> 14
1^>2^7^> 10^ 5^8^13
14
We have found a flow of size 50. And indeed, it is a maximum flow since there is no further unsaturated path. If we look for the reachable vertices, we can also find a cut with capacity 50, consisting of edges
[2,4] : 4, [7,9] : 2, [7,12] : 4, [10,12] : 2, [10,14] : 6, [13,14] : 32. □
13.D.5. Find a maximum flow in the following network on the vertex set {1,2,..., 9} with source 1 and sink 9 using the Ford-Fulkerson algorithm (during the depth-first search, choose the vertices in ascending order). Find a minimum cut in this network. Describe the steps of the procedure in detail. The edges e e E as well as the lower and upper bounds on the flow (/(e) and u(e)) and the current flow /(e) are given in the table:
13.1.19. Planar graphs. Some graphs are drawn in the plane in such a way that their edges do not "cross" one another. This means that every vertex of the graph is identified with a point of the plane, and an edge between vertices v, w corresponds to a continuous curve c : [0,1] —> R2 that connects the vertices c(0) = v and c(l) = w. Furthermore, suppose that edges may intersect only at their endpoints. This describes a planar graph G.
The question whether a given graph admits a realization as a planar graph often emerges in practical problems. Here is an example:
Providers of water, electricity, and gas have their connection spots near three houses (each provider has one spot). Each house wants to be connected to each resource so that the connections would not cross (they might want not to dig too deep, for instance). Is it possible to do this? The answer is: "no".
In this particular case, it is clear from the diagram. There is a complete bipartite graph K3t3, where three of the vertices correspond to the connection spots, while the other three represent the houses. The edges are the connections between the spots and the houses. All edges can be placed except the last one - see the diagram, where the dashed edge cannot be drawn without crossing any more:
For a complete proof, more mathematical tools are needed. A complete explanation is not provided here, but an indication of the reasoning follows.
One of the basic results from the topology of the j.,<,, plane (the Jordan curve theorem) states that every closed continuous curve c in the plane that is not self-intersecting (i.e. it is a "crooked circle") divides the plane into two distinct parts. In other words, every other continuous curve which connects a point inside a curve c and a point outside c must intersect c. If the edges are realized as piecewise linear curves (every edge composed of finitely many adjacent line segments), then it is quite easy to prove the Jordan curve theorem (you might do it yourself!). The general theorem can be proved by approximating the continuous curves by piecewise linear ones (quite difficult to do, but it is much easier if the curve is assumed to be piecewise differentiable).
Consider the graph K3t3. The triples of vertices that are not connected with edges are indistinguishable up to order. Therefore the thick cycle can be considered the general case of a cycle with four points in the graph. The position of the remaining two vertices can then be discussed. In order for the graph to be planar, either both of the vertices must lie inside the cycle, or both outside. Again, these possibilities are equivalent, so it can be assumed without loss of generality
898
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
e	1(e)	u(e)	/(e)
(1,2)	0	6	0
(1,3)	0	6	0
(1,6)	0	4	0
(2,3)	0	2	0
(2,4)	0	3	0
(3,4)	0	4	0
(3,5)	0	4	0
(4,5)	3	5	4
(4,8)	0	3	0
e	1(e)	u(e)	f(e)
(5,1)	0	3	0
(5,6)	0	6	0
(5,7)	0	5	4
(5,8)	0	5	0
(6,9)	0	5	0
(7,4)	1	6	4
(7,9)	0	3	0
(8,9)	0	9	0
o
13.D.6. A cut in a network (V,E,s,t,w) can also be viewed as a set C C E of edges such that in the network (V,E\ C,s,t,w), there is no path from the source s to the sink t, but if any edge e is removed from C, then the resulting set does not satisfy this property, i. e., there is a path from s to t in (V, E \ C U e, s, t, w). Find all cuts (and their sizes) in the following network:
Solution. Let us fix the following edge labeling:
that they are in opposite sides, as are the black vertices on the diagram. Now, their position with respect to a suitable cycle with two thick edges and two thin black edges can be discussed (i.e. through three gray vertices and one black one). Then, we can discuss the position of the remaining black vertex with respect to this cycle. This leads to the impossibility of drawing the last (dashed) edge without crossing the thick cycle.
It can be shown similarly that the complete graph K5 is not planar either. We provide a pure combinatorial argument why K5 and K3t3 cannot be planar graphs below, see the Corollary in the end of the next subsection.
Notice that if a graph G is"expanded" by dividing some of its edges (i.e. adding new vertices in the edges), then the new graph is planar if only if G is planar. The new graph is a subdivision of G. Planar graphs must not contain any subdivision of K3t3 or K5. The reverse implication is also true:
KURATOWSKI THEOREM
Then, there are cuts: {f,i}, {f,h j,a},{f,j,c,a,d,e}, {f,j,c,a,d,g {bj,c},{b,j,h},{b,i}. Their capacities are 12, 9, 20, 18, 15, 10, and 15, respectively. □
Theorem. A graph G is planar if and only if none of its subgraphs is isomorphic to a subdivision of K33 or K5.
The proof is complicated, so it is not discussed here.
Much attention is devoted to planar graphs both in research and practical applications.
There are algorithms which are capable of deciding whether or not a given graph is planar in linear time. Direct application of the Kuratowski theorem would lead to a worse time complexity.
13.1.20. Faces of planar graphs. Consider a planar graph G embedded in the plane R2. Let S be the set of those points x e R2 which do not belong to any edge of the graph (nor are vertices) In this way, the set R2 \ G is partitioned into connected subsets Si, called menaces of the planar graph G. Since the graphs are finite, there is exactly one unbounded face So. The set of all faces are denoted by S = {So, Si,..., Sk}, and the planar graph by G = (V,E,S).
The simplest case of a planar graph is a tree. Every tree is a planar graph since it can be constructed by step-by-step addition of leaves, staring with single vertex. Of course, the Kuratowski theorem can also be applied- when there is no cycle in a graph G, then there cannot be a subdivision of 7^33 or K5, either. Since a tree G cannot a contain a cycle, there is only one face So there (the unbounded face). Since the number of edges of a tree is related to the number of its vertices, cf. the formula 13.1.15(5), it follows that
IVI
\E\ + \S\
for all trees.
' Surprisingly, the latter formula linking the number of edges, faces, and vertices can be derived for all planar graphs. The formula is named after Leonhard Euler. Especially, the
899
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.D.7. Find a maximum flow in the network given in the above exercise. O Further exercises on maximum flows and minimum cuts can be found on page 957.
E. Classical probability and combinatorics
In this section, we recall the methods we learned as early as in the first chapter.
13.E.1. We throw n dice. What is the probability that none of the values 1, 3, 6 is cast?
Solution. We can also see the problem as throwing one dice n times. The probability that none of the values 1, 3, 6 is cast in the first throw is 1/2. The probability that they are cast neither in the first throw nor in the second one is clearly 1/4 (the result of the first throw has no impact on the result of the second one). Since this holds generally, i. e., the results of different throws are (stochastically) independent, the wanted probability is 1/2". □
13.E.2. We have a pack of ten playing cards, exactly one of which is an Ace. Each time, we randomly draw one of the ten cards and then put it back. How many times do we have to repeat this experiment if we require the probability of getting the Ace at least once to be at least 0.9? Solution. Let A{ be the event "the Ace is picked in the i-th draw". The events A{ are (stochastically) independent. Hence, we know that
p[[JA^l-(l-P(A1))-(l-P(A2))---(l-P(An)) for every n eN. We are looking for an n e N such that
1 - (1 - P(Ax)) ■ (1 - P(A2)) ■■■ (1 - P(An)) > 0.9. Apparently, we have P(Aj) = 1/10 for any i e N. Therefore, it suffices to solve the inequality
!-(&)"> 0.9,
whence
n > log" ai?'   where a >1. Evaluating this, we find out that we have to repeat the experiment at least 22 times. □
13.E.3. We randomly draw six cards from apack of 32 cards (containing four Kings). Calculate the probability that the sixth card is a King and, at the same time, it is the only King drawn.
number of faces is independent of the particular embedding of the graph in the plane:
Euler's formula
Theorem. Let G = (V,E,S) be a connected planar graph. Then,
\V\ - \E\ + ISI = 2.
Proof. The proof is by induction on the number of edges. The graph with zero or one edge satisfies the formula. Consider a graph G with \E\ > 1. If G does not contain a cycle, then it is a tree, and the formula is already proved for this case.
Suppose that there is an edge e of G that is contained in a cycle. Then, the graph G' = G \ e is connected, and it follows from the induction hypothesis that G' satisfies Euler's formula:
|V|-(|JE|-1) + (|5|-1) = 2,
since removing an edge necessarily leads to merging two faces of G into one face in G'. Hence Euler's formula is valid for the graph G. □
Corollary. Let G = (V,E,S), be a planar graph with \V\ =
n > 3, and \E\ = e. Then
• There is the inequality e < 3n — 6 which becomes equality if and only if G is a maximal planar graph (adding any edge to G, would violate planarity).
• IfG does not contain a triangle (i.e. the graph K3 is not a subgraph), then e < 2n — 4.
Proof. Continue adding edges to a given graph until it is maximal. If the obtained maximal graph G satisfies the inequality with equality, then the inequality holds for the original graph as well.
Similarly, if the graph G is not connected, two of its components can be connected with a new edge, so such a graph cannot be maximal. Even if it were connected but not 2-connected, there would exist a vertex v e V such that when it is removed, the graph G collapses into several components Gi,..., Gk, k > 2. However, then an edge can be added between these components without destroying the planarity of the original graph G (draw a diagram!). Therefore, it can be assumed from the beginning that the original graph G is a maximal planar 2-connected graph.
As shown in theorem 13.1.11, every 2-connected graph can be constructed from the triangle K3 by sphtting edges and attaching new ones. It is easily proved by induction that every face of a planar graph must be bounded by a cycle (which seems intuitively apparent).
However, if there is a face of our maximal planar graph G that is not bounded by a triangle, then this face can be split with another edge (a "diagonal" in geometrical terminology), so G would not be maximal. It follows that all faces of G are bounded by triangles K3. Hence 3|S| = 2\E\.
900
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. By the theorem on product of probabilities, the result is
0.0723. □
§1^1
for the number of
28 27 26 25 24 _4_ 32 ' 31 ' 30 ' 29 ' 28 27
13.E.4. We randomly draw two cards from a pack of 32 cards (containing four Aces). Calculate the probability that the second card drawn is an Ace if: a) the first card is put back; b) the first card is not put back. Solution. If the first card is put back in the pack, then we clearly repeat an experiment with 32 possible outcomes (with the same probability), 4 of which are favorable. Therefore, the wanted probability is 1/8. However, even if we do not put the first card back, the probability is the same. Clearly, the probability of a given card being drawn the first time is the same as for the second time. Of course, we can also apply the
conditional probability. This leads to
4    3     28   4 _ 1
32 ' 31 + 32 "31 ~ 8' 13.E.5. Combinatorial identities. Use combinatorial means to derive the following important identities (in particular, do not use induction): Arithmetic series   Y^=o ^ :
Geometric series Y^k-.
k=0 •
n(n+l) _ 2 ~~ „k _ x"+1-l
x-l
Binomial theorem   (x + y)n = Y^k=o {Tk)xkVn~k Upper binomial theorem   J2k=o (m) Vandermonde's convolution1    (m^n) =
o
(n+1)
\m+V
k=0 \k
) (r-k)-
13.E.6. Texas Hold 'em Poker. Now, we solve several simple problems about one of the most popular card games-Texas Hold'em Poker. We do not present its rules; they can be easily found on the Internet. What is the probability that:
i) we are dealt a pair?
ii) we are dealt an Ace?
iii) we have one of the six best poker combinations at the end?
iv) we win if we are holding an Ace and a Three and there are three Twos and the differently-suited Ace on the table (the river has not been dealt yet)?
Solution.
i) There are 4 cards of each of 13 ranks. Therefore, there are 13(1) = 78 pairs. The total number of pairs is
(1324)
0.06.
1326. Thus, the wanted probability is
It suffices to substitute 15*1 = faces in Euler's formula.
The second proposition is analogous; now, the faces of the maximal planar graph without triangles are bounded by either four or five edges, whence it follows that 415*1 < 2\E\ with the equality if and only if there are just quadrangles there.
□
The corollary implies (even without the Kuratowski theorem) that neither K5 nor K3t3 is planar: in the former case, \V\ = 5 and \E\ = 10 > 3|V| - 6, while in the latter, \V\ = 6, \E\ = 9 > 2\V\ — 4, which is again a contradiction since #33 does not contain a triangle.
13.1.21. Convex polyhedra in the space. Planar graphs can /JiU        be imagined as drawn on the sphere instead in
'^iSfipfthe plane. The sphere can be constructed from 'S&ffi*--    the plane by attaching one point "at infinity".
Again, faces of such graphs can be discussed, and the faces are now equivalent to one another (even the face 5*0 is bounded).
On the contrary, every convex polyhedron PCS3 can be imagined as a graph drawn on the sphere (project the vertices and edges of the polyhedron onto a sufficiently large sphere from any point inside P). Dropping a point inside one of the faces (that face becomes the unbounded face So) then leads to the planar graph as above - the sphere with the hole is spread in the plane.
The planar graphs that are formed of convex polyhedra are clearly 2-connected since every pair of vertices of a convex polyhedron lies on a common cycle. Moreover, every face is interior to its boundary cycle and the graphs of convex polyhedra are always 3-connected.
In fact, they are just such graphs as the following theorem (called Steinitz's theorem says (we omit the prove):
Steinitz's polyhedra theorem
Theorem. A graph G is the graph of a convex polyhedron if and only if it is planar and 3-vertex-connected.
Also called hockey identity.
13.1.22. Platonic solids. As an illustration of the combinatorial approach to polyhedral graphs, we clas-1/ sify all the regular polyhedra. These are those built up from one type of regular polygons so that the same number of them touch at every vertex. It was known as early as in the epoch of the ancient philosopher Plato that there are only five of them:
901
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
ii) One of the cards is an Ace (there are four possibilities) and the other one is arbitrary (51 possibilities). However, this includes the (2) = 6 pairs of Aces twice. Therefore, the number of favorable cases is only 4 -51 — 6 = 198 and the wanted probability is       = 0.15.
iii) We compute the probabilities of the particular combinations when dealt five cards at random:
ROYAL FLUSH: There is exactly one such combination for each suit-four in total. Further, there are (552) = 2598960 possibilities for a hand of five cards. Thus, the probability is approximately 1.5 ■ 10~6, very low indeed. STRAIGHT FLUSH: The highest card of the straight must be between 5 and K, i. e., there are 9 possibilities for each suit. Altogether, the probability is 259386960 = 1.4- 10"5.
POKER (FOUR OF A KIND): There are 13 possibilities for the quad and the fifth card can be arbitrary (48 possibilities). Hence: asUbo = 2A ' 10~4-FULL HOUSE: There are 13(g) = 52 possibilities for the triple and 12 (4) = 72 possibilities for the remaining pair. Altogether, = 1.4 ■ 10"3.
FLUSH: There are 4 suits and (13) hands for each suit,
i. e., 4 ■ ( 5) = 5148 possibilities in total. However, we must not count the straights again. There are 40 of them, so the resulting probability is 2598960 = ^ • 10~3. STRAIGHT: The highest card of the straight is between 5 and A, so there are 10 possibilities. Selecting the suit of each card arbitrarily, this gives 10 ■ 45 = 10240 possibilities. However, we must exclude flushes, so the total probability is 2^°2°°0 = 3.9 ■ 10"3. Altogether, the probability of one of the best six combinations is approximately 3.9 ■ 10~3 + 2 ■ 10~3 + 1.4 ■ 10"3 + 0.24-10-3 == 7.54 ■ 10"3, i. e., about 0.75%.
In the Texas Hold 'em variation, the best 5-card hand of the seven cards is always considered. We have computed the number of favorable 5-card hands and there are (52~5) possibilities for the remaining two cards. The total number of 7-card hands is (572). We can thus approximate the probability for Texas Hold 'em from the classic Poker by multiplying by the coefficient v 5,52\2 =21.
However, note that this is indeed only an approximation of the actual probability since some favorable combinations are counted more than once this way. For instance, we have a full house in the considered 5-card
Translate the condition of regularity to the properties of the corresponding graphs: Every vertex needs the same degree d > 3, and the boundary of every face must contain the same number k > 3 of vertices. Let 71, e, s denote the total number of vertices, edges, and faces, respectively.
Firstly, the relation of the vertex degrees and the number of edges requires
dn = 2e.
Secondly, every edge lies in the boundary of exactly two faces, so
2e = ks.
Thirdly, Euler's formula states that
2e 2e 2=n — e + s = —— el——.
a k
Put this together. The constants d and k must satisfy
1    1    1 _ 1
d ~ 2 + k ~ e' Since d, k, e, 71 are positive integers (in particular, | > 0), this equality restricts the possibilities. Especially, the left-hand side is maximal for d = 3. Substitute this value, to obtain the inequality
1 1 _ 1 6    k e
It follows that k e {3,4, 5} for a general d. The roles of k and d are symmetric in the original equality, so also d e {3,4,5}. Checking each of the remaining possibilities, yields all the solutions:
d	k	71	e	s
3	3	4	6	4
3	4	8	12	6
4	3	6	12	8
3	5	20	30	12
5	3	12	30	20
It remains to show that the corresponding regular polyhe-dra exist. This is already seen in the above diagrams, but that is not a mathematical proof. The existence of the first three is apparent. Concentrate on the geometrical construction of the regular dodecahedron (draw a diagram!).
Begin with a cube, building "A-tents" on all its sides si-1 multaneously. The upper horizontal poles are set on the level of the cube's sides so that those of adjacent It sides are perpendicular to each other. Its length is cho-f ^   sen so that the trapezoids of the lateral sides would have three sides of the same length. Now, simultaneously
902
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
hand and if the arbitrary pair contains the fourth card of the triple, then we actually have a poker (four of a kind), so this combination has been counted more times. On the other hand, the true result only merely differs from the computed approximation, so the probability of one of the best six poker combinations is about twenty times higher than in classic Poker. This may be the reason why this variation is so popular, iv) Clearly, our situation is very good. Hence, it will be easier to count the unfavorable cases when the other player has a better combination. We have Twos full of Aces, so we lose only if the opponent has Aces full of Twos or a poker of Twos, i. e., he must hold the remaining Two or an Ace. In the former case, he surely wins, and this happens in0 + 3 + 4 + -- -+ 4 + 2 = 45 cases (we can see the remaining Twos, a Three, and two Aces) out of all (426) = 1035, so the probability of this loss is about 0.043. In the latter case, there are more possibilities. If he holds a pair of Aces and the river card is not the remaining Two, then we lose; otherwise (i. e., if he has only one Ace or the Two appears on the river), it is a tie. Thus, we lose in this case by j^g-ff = 10~3. Altogether, the probability that we win or draw is almost 96 %. □
13.E.7. Four players are given two cards each from a pack consisting of four Aces and four Kings. What is the probability that at least one of the players is given a pair of Aces? Express the result as a ratio of two-digit integers. O
13.E.8. Alex owns two special dice: one of them has only 6's on its sides. The other one has two 4's, two 5's, and two 6's. Martin has two ordinary dice. Each of the players throws his dice and the one whose sum is higher wins. What is the probability that Alex wins? Express the result as a ratio of two-digit integers. O
13.E.9. In how many ways can we place n rooks on an n x n chessboard so that every unoccupied square is guarded by at least one of the rooks?
Solution. Clearly, the condition is satisfied if and only if at least one of the following holds: There is at least one rook in each rank (which implies that there must be exactly one rook in each rank-there are nn such placements since the particular squares can be selected independently for each rank); or there is at least one rook in each file (again resulting in nn
raise all tents while keeping the ratio of the three sides of the trapezoids. There is a position at which the adjacent trapezoid and triangle sides are coplanar. At that position, the regular dodecahedron is created.
Now, the regular icosahedron can be constructed via the so called dual graph construction. The dual graph G' to a planar graph G = (V,E,S) has vertices denned as the faces in S and there is an edge between faces 5*1 and 5*2 if and only if they share an edge (i.e. were neighbours) in G. Clearly the dual graph to the dodecahedron is the isosahedron. Exactly as the cube and the octohedron are dual, while tetrahedron is dual to itself.
2. A few graph algorithms
In this part, we consider several applications of graph concepts and the algorithms built upon them.
13.2.1. Algorithms   and   graph   representations. As
tT already indicated, algorithms are often for-^M^tW*' mu'atC(J with the help of the language of uMk^Ad.'^ graphs.
The concept of an algorithm can be formalized as a procedure dealing with a (directed) graph whose vertices and/or edges are equipped with further information. The procedure consists in walking through the graph along its edges, while processing the information associated to the visited vertices and edges). Of course, processing the information includes also the decision which outgoing edges must be investigated in a further walk, and in which order.
In the case of an undirected graphs, each (undirected) edge can be replaced with a pair of directed edges.
The graph may also be changed during the run of the algorithm, i.e. vertices and/or edges may be added or removed.
In order to execute such algorithms efficiently (usually on a computer), it is necessary to represent the graph in a suitable way. The adjacency matrix representation is one possibility, cf. 13.1.8. There are many other options based on various lists with suitable pointers.
The edge list (also the adjacency list) of the graph G = (V, E) consists of two lists V and E that are interconnected by pointers so that every vertex points to the edges it is incident to. and every edge points to its endpoints.
The necessary memory to represent the graph as an edge list is 0(\V\ + \E\) since every edge is pointed at twice and every vertex is pointed at d times, where d is its degree, and the sum of the degrees of all vertices equals twice the number of edges. Therefore, up to a constant multiple, this is an optimal way of graph representation in memory. It is of interest in how the basic graph operations are processed in both representations. By the basic operations, is meant:
• removal of an edge,
• addition of an edge,
• removal of a vertex,
• addition of a vertex,
• splitting an edge with a new vertex.
903
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
placements). However, there are n\ placements which satisfy both (we have n squares where to put a rook in the first rank, n—1 squares for the second rank since one of the files is already occupied, etc.). By the inclusion-exclusion principle, the wanted number of placements is:
2nn-n\. □
13.E.10. We flip a coin five times. Every time it comes up heads, we put a white ball into a hat. Every time it comes up tails, we put a black ball into the hat. Express the probability that there are more black balls than white ones provided there is at least one black ball.
Solution. Let us define the events
A - there are more black balls than white ones, H - there is at least one black ball.
We want to compute P(A\H). Clearly, the probability P(HC) of the complementary event to H is 2~5. Further, the probability of A is the same as the probability P (Ac). Therefore, P{H) = 1 - 2"5 and P(A) = 1/2. Further, P(A n H) = P(A) since the event H contains the event A (the event A implies the event H). Altogether, we have obtained
P{A\H)
P{AC\H) P(H)
1 -
I) 31
F. More advanced problems from combinatorics
In the first chapter, we met the fundamental methods used in combinatorics. Even using merely these ideas, we are able to solve relatively complicated problems.
13.F.1. There are n (n > 3) fortresses positioned on a circle, v numbered 1 through n. At a given moment, every fortress shoots at one of its neighbors (i. e., fortress 1 shoots at 7i or 2, fortress 2 shoots at 1 or 3, etc.). We will refer to the set of hit fortresses as a result (i. e., we are only interested in whether each fortress was hit or not; it does not matter whether it was hit once or twice). Let P(n) denote the number of possible results. Prove that the integers P(n) and P(n + 1) are coprime.
It is apparent that if the matrix is represented by an array of zeros and ones, then the first and second operations can be executed in 0(1) (constant time), while the others are in 0(ti) (linear time).
In the case of the adjacency list, implementation of the data structures is crucial for the time complexity. However, all of the operations should be proportional to the number of edited data units provided the corresponding item(s) are already found. For instance, if a vertex is removed, then all of the edges that are incident to it must also be removed.
The matrix representation is also useful in theoretical discussions about graphs, using matrix calculus:
13.2.2. Searching in graphs. Many useful algorithms are s based on going through all vertices of a given
'"^''pft"' graph step by step. Usually, the vertex is given 1to start with or it is selected at the beginning of sfz—?     the procedure. At every stage of the search, each vertex has (exactly) one of the following situations:
• processed - it has been visited and completely processed;
• active - it has been visited and is prepared to be processed;
• sleeping - it has not been visited yet.
At the same time, information about processed edges is retained. At every stage, the sets of vertices and/or edges in these groups must form a partition of the sets V and E while one of the active vertices is being processed.
The general principle on searching through the vertices is illustrated first. In the subsequent subsections, such procedures are used to build algorithms solving particular problems.
At the beginning of the algorithm, there is just one active vertex and all the others are sleeping. At the first step, traverse all edges incident to the active vertex and change the status of their other endpoints from sleeping to active. Then, the active vertex started may be marked as processed, and another active vertex may be chosen. In the following steps, always go through those adjacent edges not yet met, marking their other endpoints as active. This algorithm can be applied to both directed and undirected graphs.
In practical problems, the search is often restricted to only some edges going from the current vertex. This is an insignificant change to the algorithm.
To specify the algorithm completely, a decision must be made in which order to process the active vertices and in which order to process the edges going from the current vertex. In general, the two simplest possibilities of processing the vertices are:
(1) they are processed in the same order as they were visited (queue),
(2) they are processed in the reversed order than they were visited (stack).
The former case, is called a breadth-first search. The latter case is called a depth-first search.
904
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. First of all, note that a set of hit fortresses is a possible result if and only if no pair of adjacent-but-one fortresses (i. e., whose numbers differ by 2 modulo n) is unhit. Therefore, if n is odd, then P(n) is equal to the number K(n) of results where no pair of adjacent fortresses is unhit (consider the order 1, 3,5,..., n, 2,4,..., n — 1). If n is even, then P(n) equals K(n/2)2 since fortresses at even positions and those at odd positions can be considered independently.
We can easily derive the following recurrent formula for K(n): K(n) = K(n - 1) + K(n - 2). (Well, on the other hand, it is not so trivial... It is left as an exercise for the reader.) Further, we can easily calculate that K(2) = 3, K(3) = 4, K(A) = 7, so K(2) = F(4) - F(0), K(3) = F(5) - F(\),K(A) = F(6) - F(2), and simple induction argument shows that K(n) = F(n + 2) — F(n — 2), where F(n) denotes the n-th term of the Fibonacci sequence (F(0) = 0, F(l) = F{2) = 1). Moreover, since (K(2),K(3)) = 1, we have for n > 3 that (similarly as with the Fibonacci sequence)
(K(n),K(n - 1)) = (K(n) - K{n - l),K{n - 1)) = = {K{n-2),K{n-\)) = ■■■ = !.
Now, we are going to show that, for every n = 2a, P(n) = K(a)2 is coprime to both P(n + 1) = K(2a + 1) and P(n — 1) = K(2a — 1). It suffices to realize that for a > 2, we have
(K(a),K(2a + 1)) = (K(a),F(2)K(2a) + F(l)K(2a - 1
= (K(a), F{3)K{2a - 1) + F(2)K(2a -2) = ...
= (K(a),F(a + l)K(a + l)+F(a)K(a))
= (K(a), F(a + 1)) = (F(a + 2) - F(a - 2), F(a + 1
= (F(a + 2)-F(a + l)-F(a-2),F(a + l))
= (F(a)-F(a-2),F(a + l))
= (F(a - l),F(a + 1)) = (F(a - l),F(a)) = 1.
The role of the data structures used for representing the graph is immediately apparent: The adjacency list allows passage through all edges going from a given vertex in a time proportional to the number of them. Each edge is visited at most twice since it has only two endpoints. Hence the following result:
Theorem. Both the breadth-first and depth-first searches run in 0((n + m)K) time, where n and m are the number of vertices and edges of the graph, respectively. K is the time needed for processing an edge or a vertex.
The following diagram illustrates the breadth-first search through the Petersen graph:
,%        #\A'       ' *i 'K       ' V '* ' '4      V '
The first 8 steps are shown here. The circled vertex is the one to be processed, the bold vertices are the already processed one, while the dashed edges are those that have been processed, and the small vertices adjacent to some dashed edges are the active ones. At the given vertex, the edges are processed counterclockwise, beginning with the direction "straight down".
The diagram below illustrates the depth-first search applied to the same graph. Note that the first step is the same as above.
As a simple example of graph searching, consider an algorithm for finding all connected components of a given graph. The only information that must be processed during ))the search (no matter whether breadth-first or depth-first) is which component is being examined.
The search, as here presented, passes exactly the vertices of a single component. Hence, one can start with all vertices ^ in the sleeping state and choose any one of them. During '' the search, whenever there are no more active vertices to be processed, the search of one component is finished. One can then choose an arbitrary sleeping vertex and continue likewise. The algorithm terminates as soon as there are no more sleeping vertices remaining.
905
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
(K(a),K(2a-l))
= (K(a),F(2)K(2a-2)+F(l)K(2a-3))
= (K(a),F(3)K(2a-3)+F(2)K(2a-A))
= ■■■ = (K{a),F{a)K{a) + F(a - l)K{a - 1))
= (K(a), F(a - 1)) = (F(a + 2) - F(a - 2),F(a - 1))
= (F(a + 2)-F(a),F(a-l))
= (F(a + 2) - F(a + 1), F(a - 1)) = (F(a), F(a - 1))
This proves the proposition.
□
G. Probability in combinatorics
Classical probability is tightly connected to combinatorics, as we have already seen in the first chapter. Now, we present another example, which is a bit more complicated.
Combinatorics is hidden even in the following "probabilistic" problem.
13.G.1. There are 100 prisoners in a prison, numbered 1 s\^iv through 100. The chief guard has placed 100 chests 'Af/Vc (also numbered 1 through 100) into a closed room W and randomly put balls with numbers 1 through 100 on them into the chests so that each chest contains exactly one ball. He has decided to play the following game with the prisoners: He calls them one by one into the room and the invited prisoner is allowed to gradually open 50 chests. Then, he leaves without any possibility to talk to the other prisoners, the guard closes all the chests, and another prisoner is let in. The guard has promised to free all the prisoners provided each of them finds the ball with his number in one of the 50 opened chests. However, if any of the prisoners fails to find his ball, all will be executed. Before the game begins, the prisoners are allowed to agree on a strategy. Does there exist a strategy that gives the prisoners a "reasonable" chance of winning?
Solution. Clearly, if the prisoners choose to open the chests randomly (where the choices of the particular prisoners are independent), the chance for one prisoner to find his ball is 1/2, so the total probability of success is merely 1/2100. Therefore, it is necessary to look for a strategy where the successes of the prisoners are as dependent as possible. First of all, we should realize that the invited prisoner has no information from other prisoners and does not know the positions of particular balls in the chests. However, once he opens a chest, he knows the ball number it contains and may choose
13.2.3. Natural metrics on graphs. The concept of "path length" is used earlier. This recalls the general idea of distance. The concept of distance in graphs can be built mathematically in this manner.
For an (undirected) graph, define the distance between vertices v and w to be the number dc(v,w). This is the number of edges on the shortest path from v to w. If there is no such path, write dc(v, w) = oo.
= lFor the sake of simplicity, consider only connected graphs G. The function dc ■ V x V —> N denned as above satisfies the usual three properties of a distance (it is recommended to compare this to the issues from the relevant part of chapter seven, see 7.3.1 (the page 483):
• dc(v, w) > 0, and dc(v, w) = 0 if and only if v = w;
• the distance is symmetric, i.e. dc(v, w) = dc(w, v);
• the triangle inequality holds; i.e. for every triple of vertices V, w, z,
dc(v, z) < dc(v, w) + da(w, z).
dc is a metric on the graph G.
Besides these three properties, every metric on a graph apparently satisfies the following:
• dc(v, w) is always a non-negative integer;
• if dc(v,w) > 1, then there exists a vertex z distinct from v and w such that dc(v, w) = dc(v, z) + dc(z, w).
The following is true:
Every function dc on V x V (for a finite set V), satisfying the five properties listed above, allows to define the edges E so that G = (V,E) is a graph with metric dc-Prove this yourself as an exercise! (It is quite clear how to consecutively construct the corresponding graph. It remains "merely" to show that the given function da could be achieved as the metric on the constructed graph.)
13.2.4. Dijkstra's shortest-path algorithm. One may suspect that the shortest path between a given ver-
^d^'/ tex v and another given vertex w can be found ■oalS^^- by breadth-first searching the graph. With this approach, discuss first the vertices which are reachable with one edge from the initial vertex v, then those which are two edges distant, and so on. This is the fundamental idea of one of the most often used graph algorithms - the Dijkstra's algorithm5.
This algorithm is able to find the shortest paths even in problems from practice, where each edge e is assigned a weight w(e), which is a positive real number. When looking for shortest paths, the weights are to represent lengths of the
Edsger Wybe Dijkstra (1930 - 2002) was a famous Dutch computer scientist, being one of the fathers of this discipline. Among others, he is credited as one of founders of concurrent computing. He published the above algorithm in 1959
906
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
the next chest accordingly. This suggests the following simple strategy: Every prisoner starts with the chest that bears his number. If it contains the corresponding ball, the prisoner succeeds and can open the remaining chests at random. If not, he opens the chest with the number of the found ball. He continues this way until he eventually finds his ball or opens the fiftieth chest. Since every chest "points" at another chest according to the described procedure, let us call this strategy the pointer strategy.
Probability of success. The guard's possible placements of the balls bijectively correspond to permutations of the numbers 1 through 100. In order to find the probability of success, we must realize for which permutations the pointer strategy works. Recall that every permutation can be expressed as the composition of pairwise disjoint cycles. If each prisoner were allowed to open an arbitrary number of chests, he would find his ball as the last one of the corresponding cycle since he begins with the chest with his number, which is pointed at just by the chest with his ball. It follows that the strategy fails if and only if there is a cycle of length greater than 50 because then no prisoner of this cycle finds his ball in time. Thus, we must count the number of such permutations. In general, the probability that a random permutation of length n contains a cycle of length r > n/2 (there could be more occurrences of shorter cycles; however, there can be at most one cycle of length greater than n/2, which simplifies the calculation) is as follows: We must choose the r elements of the cycle, order them, and then choose an arbitrary permutation of the remaining n — r numbers. This leads to
(n) (r-l)!(n-r)! = -\r J r
Therefore, the probability that such permutation is selected (among all the n! permutations) is 1/r. Thus, the probability that our 100 prisoners succeed is
edges. However, in general, the weights may have other meanings: they may stand for profits or costs, network flows, and so on.
The input of the algorithm consists of an edge-weighted graph G = (V,E) and an initial vertex v0. The output consists of the numbers dw (y), which give the least possible sum of the weights of edges along a path from the vertex v0 to the vertex v. This procedure works in undirected graphs as well as in directed ones.
In order to ensure the termination and the correctness
of the algorithm, it is important that all of the weights are positive - see example 13.B.6. Dijk-stra's algorithm needs only a little modification of the general breadth-first search:
• For every vertex v, keep the information d(v), which is an upper bound for the actual distance of v from the initial vertex v0.
• At every stage, the already processed vertices are those for which the shortest path is already known. Then,
d(v) = dw(v).
• When some sleeping vertices are to be made active, choose exactly those vertices y from the set Z of sleeping vertices for which d(y) = min{ <i(z); z G Z}. Suppose that the graph G has at least two vertices. More
formally, Dijkstra's algorithm can be described as follows:
Diikstra's Algorithm
Input: vortex v0 in the graph G = (V,E) with weights on all edges.
Output: the distances from v0 within G associated to all vertices.
(1) Initialization step: Set the values for all v e V:
d(v)=i° f0r"°' I oo   for v v0.
Set Z = V, W = 0.
(2) Cycle condition: If every vertex y G Z is assigned oo, the algorithm terminates; otherwise the algorithm continues with another iteration. (In particular, the algorithm terminates if Z = 0.)
(3) Update of the vertex statuses:
• Find the set N of those vertices v G Z for which d(v) = S is as small as possible:
S = min{d(y); y G Z}.
• All vertices which have been in W are removed and marked as processed; the new set of active vertices is W = N, while all these vertices are removed from Z, i.e. they are no more sleeping.
(4) Cycle body: For each edge e G Ewz (i-e. whose tail is an active vertex v and head is a sleeping vertex y:
• if d(v) +w(e) < d(y), then update d(y) to d(v) + w(e).
Move back to check the cycle condition (step 2).
907
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
1
100 ^
k=51
k
0.311828
As we can see, this is much higher than the original 1/2100. Now, let us look at the behavior of this probability for a general number n of the prisoners (then, each prisoner is allowed to open at most n/2 chests). In general, the probability that a random permutation of length n contains a cycle of length greater than n/2 is equal to
V ■
n 1 ^ k
k=l + i
Recall that Ylk=i \ ~^ m (n) + 7 f°r n ~^ °°> wnere 7 is Euler's constant. Thus, we have:
V ■
71    1       2    1 /71
Yu^Yl -^ln(n)+7-In(?)-7 = ln2, pro n
fc=i fc=i
Hence it follows that, for large values of n, the probability of success approaches 1 — p ~ 1 — In 2 = 0.30685.... Now, we are going to show that the pointer strategy is optimal. Optimality of the pointer strategy. In order to prove the optimality of the pointer strategy, we merely modify the rules of the game and further define another game.
Consider the following rules: Every prisoner keeps opening the chests until he finds his ball. The prisoners win iff each opens at most 50 chests. Clearly, this modification does not change the probability of success, but it will help us prove the optimality. We will refer to this game as game A.
Now, consider another game (game B) with the following rules: First, prisoner number 1 enters the chest room and keeps opening the chests until he finds his ball. Then, the guard leaves the opened chests as they are and immediately invites the prisoner with the least undiscovered number. The game proceeds this way until all chests are opened. The prisoners win iff none of them opened more than 50 (n/2 in the general case) chests.
Suppose that the guard notes the ball numbers in the order they were discovered by the prisoners. This results in a permutation of the numbers 1 through 100, from which he can see whether the prisoners won or not. The probability of discovering a particular ball is at every moment independent of the selected strategy. There are 100! permutations which correspond to some strategies (no matter whether they
13.2.5. Theorem. For a given vertex v0, the Dijkstra's algorithm finds the distance dw (v) of each vertex v in G that lies in the connected component of the vertex v0. For the vertices v of other connected components, d(v) = oo remains.
The algorithm can be implemented in such a way that it terminates in time 0(n log n + m), where n is the number of vertices and m is the number of edges in G.
Proof. The algorithm is correct, since
• it terminates after a finite number of steps;
• when it does, its output has the desired properties.
The cycle condition guarantees that in each iteration, the number of sleeping vertices decreases by one at least since N is always non-empty. Therefore, the algorithm necessarily terminates after a finite number of steps.
After going through the initialization cycle,
(1)
dw(v) < d(v)
for all vertices v of the graph. Now assume that this property holds when the algorithm enters the main cycle and show that j£ holds when it leaves the cycle as well. Indeed, if d(y) is changed during step 4, then it is caused by finding a vertex v such that
dw(y) < dw(v) + w({v,y}) < d(v) + w({v,y}) = d(y),
where the new value is on the right-hand side.
The inequality (1) is satisfied when the algorithm terminates. It remains to verify that the other inequality holds as well. For this purpose, consider what is actually done in steps 3 and 4 of the algorithm.
Let 0 = do < ■ ■ ■ < dk denote all (distinct) finite distances dw (v) of the vertices in G from the initial vertex vq . At the same time, this partitions the vertex set of the graph G into clusters Vi of vertices whose distance from vq is exactly d{. During the first iteration of the main cycle, N = Vo = {^o}, the number S is just d±, and the set of sleeping vertices is changed to V \ Vo.
Suppose this holds up to j-th iteration (inclusive), i.e. the algorithm enters the cycle with N = Vj, 5 = dj, and Ui=o ^ = V \ N. Consider a vertex y e Ij+i, i.e. dw(y) = dj+1 < oo, and there exists a path (i>o, ei,vi,..., ve, ee+i,y) with total length dj+i. However, then
(2)
dw(ve) < dj+1 -w({ve,y}) < dj+1.
It follows from the assumption that the vertex ve was active during an earlier iteration of the main cycle, and dw(vg) = d(vt) = di for some i < j then. Therefore, after the current iteration of the main cycle has been finished,
d(y) = dw(ve) + w({vf,y}) = dj+1
and this does not change any more. It follows that the inequality (1) holds with equality when the algorithm terminates.
908
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
are random or sophisticated) since they are merely the orders in which the ball numbers are discovered. In order to compute the probability of success in game B, we should note that any order can be written as the composition of cycles where each cycle contains the ball numbers discovered by a given prisoner. For the sake of clarity, consider a game with 8 prisoners. If the guard has noted the permutation (2,5, 7,1,6,8,3,4), then we can see that the prisoners win since prisoner 1 discovered numbers (2,5,7,1), then prisoner 3 discovered (6,8,3), and finally prisoner 4 discovered only his number (4). In this case, we can write: (2,5,7,1,6,8,3,4) -> (2, 5,7,1)(6,8, 3)(4). Further, any such permutation corresponds to a unique order of the numbers 1 through 8. Having any permutation in the cyclic notation, we first rearrange each cycle so that its least number is the last one and then sort the cycles by their last numbers in ascending order. For instance, we have:
(7,5, 8)(2,4)(1,6,3) -j- (6,3,1)(4, 2)(8,7,5) -j-(6,3,1,4,2,8,7,5).
We have thus constructed a bijection between the winning orders of discovered numbers and the permutations of the numbers 1 through 8 that do not contain a cycle of length greater than 4. It follows that the probability of success in game B is the same as the probability that a random permutation does not contain a cycle of length greater than 4 (nj 2 in the general case). This corresponds to the probability of success in the original game using the pointer strategy. Now, this implies an important conclusion for game A. Indeed, the prisoners may apply any strategy from game A to game B as follows: each prisoner behaves like in game A, but he considers open chests to be closed, i. e., if he wants to open a chest which has already been opened, he just "p"asses this move and further behaves as if he had just discovered the ball number in the considered chest. Therefore, any strategy that succeeds for a given placement of the balls in game A must succeed for the same placement in game B as well. Therefore, if there existed a better strategy for game A, we could apply it to game B and obtain a higher chance of winning there. However, this is impossible since all strategies in game B lead to the same probability of success. Therefore, the pointer strategy is better than or equally good as any other strategy. □
13.G.2. In a competition, there are m contestants and n officials, where n is an odd integer greater than two. Each official
The analysis of the main cycle just made also determines a bound for the running time of the algorithm (i.e. the number of elementary operations with the graph and other corresponding objects). The main cycle is iterated as many times as there are (distinct) distances d{ in the graph. Every vertex, when processed during step 3, is considered exactly once. The vertices that are still sleeping must be sorted. This gives the bound 0{n log n) for this part of the algorithm provided the graph is stored as a list of vertices and weighted edges the sleeping vertices are kept in a suitable data structure that allows the finding of the set of N active vertices in time 0(logn+|7Y|). This can be achieved if a heap is used. Every edge is processed exactly once in step 4 since the vertices are active only during one iteration of the cycle. □
Note that the inequality (2), essential for the analysis of the algorithm, need not hold if the weights of the edges are allowed to be negative.
In practice, many heuristic improvements of the algo-
■ # rithm are applied. For instance, it is not nec-^SiJtlK essary 1° compute the distance between all ver-sSgssiS^s tices if only the distance between a given pair of vertices is of interest. When the vertex is excluded from the active ones, its distance is final.
Further, it is not necessary to initialize the distances with the value of infinity. Of course, this is technically impossible, and a sufficiently large constant would ne needed in the implementation. However, there is a better solution than that. For instance, if the shortest path in a road network is required, the known air distances can be used as the initialization values. Then, the bounds for the distances d° (y) between vertices v and v0 can be used such that for any edge e = {v, y},
\d0w(v)-d°w(y)\<w(e).
This is sufficient for the proof of the correctness of the algorithm. (Check this yourself!)
13.2.6. Spanning trees. In practical applications, graphs of-ten encode all possibilities of connections be-
____tween particular objects, as in road or electrical
'"^gjgsT networks. If it is only required that each pair €^~J^M - — of vertices is connected by a path, using as few edges as possible, then what is needed is a subgraph T which is a tree. This corresponds to the problem of finding a type of minimal network.
Spanning tree of a graph
Definition. Any tree T = (V,Ef) in a graph G = (V,E), E C E is called a spanning tree of the graph G.
A graph can have a spanning tree if and only if it is connected.
A spanning tree is connected since all trees are. Conversely, the following algorithm finds a spanning tree for any given connected graph.
909
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
judges each contestant as either good or bad. Suppose that any
pair of officials agree on at most k contestants. Prove that
k     n — l — > -.
771 2?1
Let us look at two possible approaches to this problem. Solution. Let us count the number N of pairs ({official, official}, contestant) where the officials are distinct and agree on the contestant. Altogether, there are Q) pairs of officials, and each pair can agree on at most k contestants. Therefore, N<k(£).
Now, let us fix a contestant X and count the number of pairs of officials who agree on X. Say that x officials said X was good. Then, there are (*) pairs who agree that X is good and f""1) pairs who agree that X is bad. Altogether, there
x\ (n — x\ x(x — V) (ti — x)(n — x — 1) 2) + \   2    )~      2      + 2
pairs that agree on X. We have:
x(x — 1)     (ti — x)(n — x — 1)     2a;2 — 2nx + ti2
+
71 \ 2 712 71 712 71 (ti — 1) X--H---->---= ---
2/4     2^4     2 4
2 1 4'
Since ti is odd, the expression (ti — l)2/4 is an integer. Thus, the number of pairs that agree on X is at least (ti — l)2/4. Hence N > m(n — l)2/4. Combining these two inequalities together, we get
k     ti — 1 — > -.
t71 2?1
An alternative solution - using probabilities. Let choose a pair of officials at random. Let X be the random variable which tells the number of cases when this pair agrees. We are going to prove the contrapositive implication, i. e., if ^ < then X is greater than k with probability greater than zero, which will be denoted P(X > k) > 0.
Consider the random variables X{ for i = 1,2,... , m with codomain 0,1, denoting whether the pair agrees on the j-th contestant. Let X{ = 1 when they agree, and let X{ = 0 otherwise. Hence we have:
X = X1 + x2 + ■ ■ ■ + xm
Using the linearity of expectation, we obtain:
E[X] = £[Ax] + E[X2] + ■■■ + E[Xm}.
Now, let us calculate E[Xp\ = J2Xte{o 1} xi ' P(-^« = xd-Since X{ can be only 0 or 1, we have directly E[Xi] = P(Xi = 1). Let us examine the probability P(Xj = 1), i. e., that the officials agree on the i-th contestant. There are
Spanning forest algorithm Input: Graph G = (V,E)
Output: A forest T = (V,E) consisting of spanning trees of the components of G.
(1) Sort all edges e0,..., em e E in any order.
(2) Start with E0 = {e0} and gradually build the sets of edges E{ so that in the i-th step, Add the edge e{ toEi-1 unless this creates a cycle in the graph Gi = (V, Ei_1 U {e,}). If this edge creates a cycle, leave E{ = Ei_1 unchanged.
(3) The algorithm terminates if the graph Gi = (V,Ei) has exactly ti — 1 edges at some step i or if i = m, and produces the graph T = (V,Ei).
If the algorithm terminates for the latter reason, then the graph is not connected and no spanning tree exists (but there are still the spanning trees of all individual components).
Proof. It follows from the rules of the algorithm that the resulting subgraph T of G never contains a cycle. Therefore, it is a forest. If the resulting number of edges is ti — 1, then it must be a tree; see theorem 13.1.15.
It remains to show that the connected components of the graph T have the same sets of vertices as the connected components of the original graph G: Every path in T is also a path in G; therefore, all vertices that lie in one tree of T must lie in the same component of G. If there is a path in G from v to w such that its endpoints lie in different trees of T, then one of its vertices v{ is the last one that is in the component determined by v (in particular, vi+1 does not lie in this component). The corresponding edge {v{, vi+1} creates a cycle when examined by the algorithm since otherwise, it would be in T. Since the edges are never removed from T, there is a path between v{ and vi+1 in T, which contradicts the assumptions. Therefore, v and w cannot lie in different trees of T. The number of components of T is given by the fact that the number of vertices and edges differs by one in every tree. The difference increases by one with every component so if there are ti vertices and k edges in the forest, then there are ti — k components. □
Remark. As always, the time complexity of the algorithm is of interest. The addition of an edge creates a cycle if and only if its endpoints lie in the same connected component of the forest T under construction.
Knowledge of the connected components of the current forest T is helpful. To implement the algorithm, it is needed to unite two equivalence classes of a given equivalence relation on a given set (the vertex set) and to find out whether two vertices are in the same class or not. The union requires time 0{k), where k is the number of elements to be united, k can be bounded from above by ti, the total number of vertices.
However, for each equivalence class it can be noted how many vertices it contains. If, for each vertex, the information to which class it belongs is kept, then the union operation
910
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
(2) pairs of officials. Let t{ denote the number of officials who say the i-the contestant is good and n —1{ be the number of those who do not. Then, there are (*2) pairs who agree that the 2-the contestant is good and ("~2U) pairs who agree on the contrary. Altogether, there are (*2') + ("~2t') pairs that agree on the j-th contestant. Therefore,
E[Xt] = P(X2 = 1) =
(2) + r t
Hence,
We are going to show that for odd values of n, we have (*2') +
2   ) — 4
. Rearranging this leads to
(n - 2ti)2 >l^>ti<
n-l
or ti >
71+1
which is clearly true since      and Ik^Y- are adjacent integers.
Using the inequality (*') + (" J') >
, we obtain:
E\X] > m
(V")2 _ m(n-l)
n(n— 1) 2
2n
Thanks to the assumption m{%n ^ > k, we have E[X] > k,
and thus P(X > k) > 0, which finishes the proof.
□
Further, we demonstrate application of probabilities to an interesting problem.
13.G.3. Let S be a finite set of points in the plane which are in general position (i. e., no three of them lie on a straight line). For any convex polygon P all of whose vertices lie in S, let a(P) denote the number of its vertices and b(P) the number of points from S which are outside P. Prove that for any real number x, we have
Y,xa{P)(^-x)b{P) = 1 p
where the sum runs over all convex polygons P with vertices in S. (A line segment, a singleton, and the empty set are considered to be a convex 2-gon, 1-gon, and 0-gon, respectively.)
Solution. First of all, we prove the wanted equality for x e [0,1]. Let us color a point from S so that it is white with probability x and black with probability 1 — x (in other words, we consider a random choice of the size 15*1 with the binomial probability distribution Bi(n, x) and let us say that success corresponds to white and failure corresponds to black). We
means to relabel the vertices of one of the united classes. If the smaller class is always selected to be relabeled, then the total number of operations of the algorithm is 0{n log n+m). (As an exercise, complete the details of these considerations yourself!)
The above reasoning shows that slightly better in time might be achieved, if only the spanning tree of _ the connected component of a given starting vertex is of interest:
Another spanning tree algorithm
Input: G = (V,E) with n vertices and m edges, vertex
v e V.
Output: The tree T spanning the conneced component of v.
(1) InitializeT0 = {{v},%).
(2) In the i-th step, look for edges e which are not in Ti_1, but their tail vertex ve is. Take one of them and add it to Tj_i, i.e. add the head vertex to Vi-\ and e to Ei_1.
(3) The algorithm terminates as soon as no such edge exists. Apparently, the resulting graph T is connected. The count of its vertices and edges shows that it is a tree.
Proof. The vertices of T coincide with the vertices of the connected component of the graph G containing the starting vertex v.
Suppose there is a path from v to a given vertex w. If w does not lie in T, then label it by v{ the last of its vertices that lie in T (just like in the proof of the previous lemma). However, the subsequent edge of the path would have to be added to T by the algorithm when it terminated, which is a contradiction.
Consequently, this algorithm finds a spanning tree of the connected component that contains a given initial vertex v in time 0(n + m). □
13.2.7. Minimum spanning tree. All spanning trees of a ^gs given graph G have the same number of edges •si <\ since this is a general property of trees. Just as .;-    ? the shortest path in graphs with weighted edges
was found, now spanning trees with the minimum sum of their
edges' weights is desired.
Definition. Let G = (V,E, w) be a connected graph whose edges e are labeled by non-negative weights w(e). A minimum spanning tree of G is such that its total weight does not exceed that of any other spanning tree.
This problem has many applications in practice. For instance, networks of electricity, gas, water, etc.
Surprisingly, it is quite simple to find a minimum spanning tree (supposing all edge weights w(e) of G are non-negative) by the following procedure6:
Joseph Bernard Kruskal (1928 - 2010) was a famous American mathematician, statistician, computer scientist, and psychometrician. There are other famous mathematicians of the same surname - his two brothers and one nephew. Martin David co-invented solitons and surreal numbers, William
911
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
can note that for any such coloring, there must exist a polygon such that all of its vertices are white and all points outside are black (this polygon is the convex hull of the white points). The above suggests that the probability that the random choice realizes a polygon with all vertices white and all exterior points black is equal to one. However, we can compute this probability in a different way. The event of a polygon having this property is the union of k disjoint events, where k is the number of convex polygons, namely that a given polygon has the desired property (note that the property cannot be shared by different convex polygons). For every given convex polygon P, the probability that its vertices are white and the points outside it are black is equal to xa<-p\l — x)b<-p\ where a(P) is the number of vertices of P and b(P) is the number of points from S outside P. Since the probability of a union of disjoint events is equal to the sum of the particular events' probabilities, we get
Y,xa{P)(^-x)b{P) = 1-p
This proves the equality for all numbers in the interval [0,1]. However, we can also perceive this fact as follows: any real number from the interval [0,1] is a root of the polynomial ^2Pxa(p\\ — x)b<-p^ — 1. As we know, a nonzero polynomial over the (infinite) field of real numbers can have only finitely many roots (see 12.2.6). Therefore, J2p xa<-p^ (1 — x)b<-p^ — 1 is the zero polynomial and the equality J2p xa(p) (1 - x)h(P) = 1 thus holds for all real numbers x. □
Remark. This equality holds even if we define the numbers a(P) and b(P) in another way: The definition of a(P) is the same, but now let b(P) denote the number of points from S which are not the vertices of P. (Thus, we always have a (P) + b(P) = 15*1). Then, the given equality is a corollary of the binomial theorem for (x + (1 — x))\s\.
13.G.4. A competition with n players is called an (n, k)-tournament iff it has k rounds and satisfies the following:
i) every player competes in every round and any pair of players competes at most once,
ii) if A plays with B in the i-th round, C plays with D in the i-th round, and A plays with C in the j-th round, then B plays with D in the j-the round
Find all pairs (n, k) for which there exists an (n, k)-tournament.
Kruskal's algorithm
Input: A graph G = (V,E,w) with non-negative weights over edges.
Output: The minimal spanning trees for all components of G.
(1) Sort the m edges in E so that w(ei) < w(e2) < • • • <
w(em).
(2) For this order of the edges, call the "Spanning forest algorithm" from the previous subsection.
This is atypical example of the "greedy approach", when maximizing profits (or minimizing expenses) is achieved by choosing always the option which is the most advantageous at each stage.
In many problems, this approach fails since low expenses at the beginning may be the cause of much higher ones at the end. Therefore, greedy algorithms are often a base for very useful heuristic algorithms but seldom yield optimal solutions. However, in the case of minimum spanning tree, this approach works:
Theorem. Kruskal's algorithm finds a minimum spanning tree for every connected graph G with non-negative edge weights. The algorithm runs in 0(m log m) time where m is the number of edges of G.
Proof. Let T   =   (V,E(T)) denote the spanning ,, tree generated by Kruskal's algorithm and, fur-ther, let f   =   (V,E(f)) be an arbitrary mini-r\ri% mum spanning tree.   The minimality implies that
^     EeGE(T) W(e)   <  EeGE(T) W(e), s° the goal is tO
show that also J2eeE(T) w(e) = J2eeE(f) w(e)-
If E(T) = E(f), then nothing further is needed. So assume there exists an edge e e E(T) such that e ^ E(f). From all such edges, choose one, call it e with weight w(e) as small as possible.
The addition of e into T creates a cycle eeie2 ■ ■ ■ ek in T, and at least one of its edges e{ is not in E(T). The choice of the edge e implies that if w(ei) < w(e), then the edge e{ would be among the candidate edges in Kruskal's algorithm after a certain subtree T" C T n T had been created, so its addition to the gradually constructed tree T would not create a cycle. Therefore, if w(ei) < w(e), the edge e{ would be chosen in the algorithm. It follows that w(ei) > w(e).
However, now the edge e{ can be replaced with e in T (by the choice of e{, this results in a spanning tree again) without increasing the total weight. So the resulting T is a minimum spanning tree. It differs from T in fewer edges than before. Therefore, in a finite number of steps, T is changed to T without increasing the total weight. □
13.2.8. Two more algorithms. The second algorithm for finding a spanning tree, presented in 13.2.6 also leads to a minimum spanning tree:
was active in statistics, Clyde was a computer scientist too. The above algorithm dates from 1956.
912
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. There exists an (n, k)-tournament if and only if 2 rios2divides the integer n. First of all, we are going to show the if-part. We construct a (2*, k)-tournament where k < 2* — 1 (then, the general case 2* | n can be easily derived from that). There are thus 2* players in the tournament, so we assign to each player a (unique) number from the set {0,1,..., 2* — 1}. In the i-th round, player a competes with player a® i (where ffi is the binary XOR operation, i. e., the j-th bit of a ffi b is one if and only if the j-th bit of a is different from the j-th bit of b). This schedule is correct since every player is engaged in every round and different players have different opponents (for a / ft we have a ffi i =^ (3 ffi i). Further, the opponent of the opponent of a is indeed a (since (a ffi i) ffi i = a). Moreover, the second tournament rule is also satisfied: if a plays with (3 and 7 plays with S in the i-th round (i. e., (3 = a ffi i and S = 7 ffi i) and if j is such that a plays with 7 in the j-th round, then we have 13 ffi j = (a ffi i) ffi j = (a ffi j) ffi i = 7 ffi i = 5, so (3 indeed plays with S in the j-th round. Any (2* ■ s, k)-tournament where s is odd can be obtained as s parallelized (2*, fc)-tournaments.
Now, we are going to show that the condition 2ri°s2(fc+1)l J n is necessary as well. Consider the graph Gi whose vertices correspond to the players and edges are between the pairs who have played in or before the i-th round. Consider players A and B who play together in round i + 1. We want to show that we must have 7ft = \A\ where r is the component of A and A is the component of B. Actually, we show that any player of r competes with a player of A in round i + 1. Thus, let C 6 T, i, e., in Gi, there exists a path A = X1, X2, . . ., Xm = C such that Xj has played with Xj+i, j = 1,..., m — 1, in or before the 2-th round. Consider the sequence Y1,Y2,.. .Ym, where 1ft is the opponent of X^ in round i + 1, k = 1,... ,m (thus Y1 = B). Then, for any 1 < j < m — 1, we have that Xj competes with Yj and Xj+1 competes with Yj+1 in round i + 1 (by the definition of the sequence Y1,..., Y{) and in a certain r-the round (1 < r < i), Xj played with Xj+1 (by the definition of the sequence Xi,..., Aft). However, by the second tournament rule, this means that 1ft also played with 1ft+i in the r-the round, so the edge 1ft 1ft+1 is contained in Gi for any 1 < j < m — 1, Thus, Y1,Y2,... Ym is a path in Gi, so B = Y1 and Ym lie in the same component (A). It can be deduced analogously that any player from A
Jarnik-Prim's algorithm7
Input: A connected graph G = (V,E,w) with n vertices and m edges, with non-negative weights over the edges. Output: The minimum spanning tree T of G.
(1) Initialize T0 = ({v}, 0) with some vertex v e V.
(2) In the i-th step, look for all edges e which are not in 7ft_i, but their tail vertex ve is. Take the one of them with minimal weight and add it to 7ft_i, i.e. add the head vertex to 1ft_i and e to ift_i.
(3) The algorithm terminates when the number of added edges totals at n — 1.
The Boruvka 's algorithm is similar. It constructs as many as possible connected components simultaneously: It begins with the singleton components in the graph T0 = (1ft 0). In each step, it connects every component to another component with the shortest edge possible. It is easy to prove that (provided the edge weights are pairwise distinct) this results in a minimum spanning tree.
Boruvka's algorithm
Input: A connected graph G = (V,E,w) with non-negative weights on the edges.
Output: The minimum spanning tree for G.
(1) Initialization. Create the graph S with the same vertex set as G and no edges;
(2) The main loop. While S contains more than one component, do:
• for every tree T in S, find the shortest edge that connects T to G \ T, and add this edge into E,
• add all edges of E into the graph S and clear E.
Note that Boruvka's algorithm can be executed using parallel computation, which is why it is used in many practical modifications.
The proofs that both of these algorithms are correct, are similar to that of Kruskal's. The details are omitted.
13.2.9. Traveling salesman problem.
^ ^agj.. <wap So far, our short excursion through graph f%^£5jj^ based algorithms could give the feeling that -dui/^^ simple and straightforward algorithms for the considered problems can always be found. So far however, only the easy problems have been considered. In all but very few cases, the contrary is true — mostly there are no algorithms running in polynomial time, so one needs to use algorithms which do not always find the optimal solution but give
Robert Clay Prim (born 1921) is an American mathematician and computer scientist. While he published his work already in the realm of computer science and hence the most common name of the algorithm is "Prim's", earlier works by Otakar Borůvka (1899 -1995) and Vojtěch Jarník (1897-1970) appeared before those by Prim. The Boruvka's algorithm was designed when consulting the construction of new electricity network in Moravia, a region in central Europe, in 1926, and Jarník published the algorithm (recovered much later by Prim) in 1930, motivated by Brůvka.
913
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
competes with a player of r in round i + and since every player plays exactly once in a given round, we must have \r\ = \A\. By the definition of a component, the component of A in Gj+i is equal to r U A. Then, we have either r = A (then, the component of A in G^+i is r\ or T n Z\ = 0 (in this case, the component of A in G^+i is the disjoint union r U A). Altogether, the component of A in G^+i is either the same or twice as great as in G^. Now, consider the components T2,..., of A in the respective graphs Gi, G2, • • • Gfc. We have A = 2 (since A had exactly one opponent in the first round) and for 1 < i < k — 1, we have either \r{\ = |il+i|, or 2|Tj = Therefore, the
number of vertices (players) of every component is a power of two, i. e., = 2l for some /, and > k + 1 (A had different opponents in the k rounds). Hence, 2l > k + 1, i. e., 2l is at least 2 flog2 ; so the number of players in each component is divisible by 2 flog2 . Thus, so must be the total number n. □
H. Combinatorial games
13.H.1. Consider the following game for two players: On the table, there are four piles of 9, 10, 11, and 14 tokens, respectively. Players alternate moves where the move consists of selecting one of the piles and removing an arbitrary (positive) number of tokens from that pile. The player who takes the last token wins. Is there a winning strategy for one of the players?
Solution. Note that this game is the sum of four games which correspond to one-pile games where an arbitrary (positive) number of tokens can be removed (the sum of combinatorial games is both commutative and associative, so we can talk just about the sum of those games without having to specify the order). A simple induction argument shows that the value of the Sprague-Grundy function (the SG-value) of such one-pile game is equal to the number of tokens: Suppose that a natural number n is such that for all k < n, the SG-value of the game with k tokens is k. According to the rules of the game, we can remove an arbitrary (positive) number of tokens, i. e., we can leave there an arbitrary number from 0 to n — 1. By the induction hypothesis, this means that for any number k < n, we can reach a position whose SG-value is k, and we cannot reach a position whose SG-value would be n. By the definition of the SG-function, the value of the game with n tokens is n. It follows from the theorem of subsection
one which is as good as possible. This is called a heuristic approach.
One of the most important combinatorial problems of this class is the problem of finding a minimum Hamiltonian cycle. This is a Hamiltonian cycle with the minimum sum of the weights of its edges among all Hamiltonian cycles.
This problem arises in many practical applications. For instance:
• goods or post delivery (via a given network)
• network maintenance (electricity, water pipelines, IT, etc.)
• request processing (parallel requests for reading from a hard disk, for instance),
• measuring several parts of a system (for example, when studying the structure of a protein crystal using X-rays, the main expenses are due to the movements and focusing for particular measurements),
• material division (for instance, when covering a wall with wallpaper, one tries to keep the pattern continuous while minimizing the amount of unused material)).
The greedy approach can be applied in case of looking for a minimum Hamiltonian cycle as well. The algorithm begins in an arbitrary vertex vi, which is set active, and the other vertices are labeled as sleeping. For each step, it examines the sleeping vertices adjacent to the active one and selects the one which is connected by the shortest edge. The active vertex is labeled as processed, and the selected vertex becomes active. The algorithm terminates either with a failure, when there is no edge going from the active vertex to a sleeping one, or it successfully finds a Hamiltonian path. In the latter case, if there exists an edge from the last vertex vn to vi, a Hamiltonian cycle is obtained.
This algorithm seldom produces a minimal Hamiltonian cycle. At least, it always finds some (and relatively small) Hamiltonian cycle in a complete graph.
13.2.10. Flow networks. Another group of applications of f*.    the language of graph theory concerns moving , I-jX    some amount of a measurable material in a fixed network. The vertices of a directed graph rep-" resent places between which one transports material up to predetermined limits which are given as assessments of the edges (called capacities). There are two important types of vertices: the sources and sinks of the network. A network is a directed graph with valued edges, where some of the vertices are labeled as sources or sinks.
Without loss of generality, assume that the graph is directed and has only one source and one sink: In the general case, an artificial source and a sink can always be added, connected with directed edges to the original sources and sinks. Then the capacities of the added edges would cover all maximum capacities of the particular sources and sinks. The situation is depicted in the diagram. There, the black vertices on the left correspond to the given sources, while the black vertices on the right stand for the given sinks. On the left,
914
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.2.16 that the SG-value of the initial position of our game is equal to the xor of the initial positions of the particular games, namely
9© 10© 11 © 14 = 6.
Since this value is non-zero, there exists a winning strategy for the first player: he always moves to a position whose SG-value is zero-such a position must exist by the definition of the SG-function. For instance, the first move would be to remove 6 tokens from the pile containing 14. (We look at the highest one in the binary expansion of the SG-value and find a pile where the corresponding bit is also one. Then, we set this bit to zero-thereby surely decreasing the number of tokens- and adjust the lower bits so that there would be an even number of ones in each position, resulting in zero SG-value.) □
13.H.2. Consider the following game for two players: On the table, there is a pile of tokens. Players alternate moves where the move consists of either splitting one pile into two (non-empty) piles or removing an arbitrary (positive) number of tokens from a pile. The player who takes the last token wins. Find the SG-value of the initial position of this game if the pile contains n tokens.
Solution. We are going to prove by induction that any positive integer k satisfies:
g(4k + \) = 4k+ 1
g(4k + 2) = 4k + 2
g(4k + 3) = 4k+ 4
g(4k + 4) = 4k+ 3
Clearly, we have g(0) = 0. The following picture shows how we can deduce the value of the SG-function for one-, two-, and three-token piles. However, it is apparent that this would be much harder for a general number of tokens.
there is an artificial source (a white vertex), and there is an artificial sink on the right. The edge values are not shown in the diagram.
Flow networks
A network is a directed graph G = (V,E) with a distinctive vertex z, called the source, and another distinctive vertex s, called the sink, together with a non-negative assessment of the edges w : E —> R, which represents their capacities. A flow in a network S = (V,E, z, s, w) is an assessment of the edges / : E —> R such that, for each vertex v except for the source and the sink, the total input is equal to the total output, i.e.
E /(e) = E /w-
e€lN(v) eeOUT(v)
This rule is often called the Kirchhoff's law (referring to the terminology used in physics).
The size of a flow f is given by the total balance of the source values
E /(e)- E /(e)-
eeOUT(z) eelN(z)
It follows directly from the definition that the size of a flow / can also be computed as
E /(e)- E /(e)-
eelN(s) eeOUT(s)
The left hand part of the following diagram shows a simple network with the source in the white circled vertex and the sink in the black bold vertex. The labels over the edges determine the maximal capacities. Looking at the sum of the capacities that enter the sink, the maximum flow in this network is 5 (the sum of the capacities leaving the source is larger).
13.2.11. Maximum flow problem. The next task is to find the maximum possible flow for a given network on a graph G. The right hand side of the above diagram shows a flow of size five, and the size of any flow cannot exceed this. The fundamental principle is that the capacities of a set of edges
915
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Now, assume that the above is satisfied for all positive integers below 4k + 1 and let us prove that we indeed have g(4k + 1) = 4k + 1. By the definition, the SG-value is the least non-negative integer / such that there is no move to a position with S'G-value /. Moreover, this property (including that the terminal positions have zero value) determine the Sprague-Grundy function uniquely. Therefore, it suffices to prove that, for each / < 4k+1, we can move to a position with SG-value /, and that we cannot get into a position with SG-value 4k + 1. The former is clear since by the induction hypothesis, the S'G-values of one-pile games of 0,1,..., 4k tokens take all the integers 0,1,... ,4k (although not in this order), so we can just remove the corresponding number of tokens from the pile. Now, we are going to show that we cannot reach a position with SG- value 4k+1: We already know that the only moves that could possibly lead to this SG-value are to split the pile into two. If we examine the resulting amounts modulo 4, there are two possibilities: either the number of tokens in one of the resulting piles is divisible by 4 (4a) and the other one leaves remainder 1 (4b + 1), or the numbers leave remainders 2 and 3, respectively. As for the former case, the S'G-values of the resulting piles are, by the induction hypothesis, 4a—1 and 4b+l (the numbers of tokens in the particular piles are non-zero and less than 4k + 1, so we may use the induction hypothesis. In the latter case, i. e., if we split the pile into 4a + 2 and 4b+ 3 tokens, we get that their SG- values are 4a + 2 and 4b + 4. Furthermore, a two-pile game is the sum of the two corresponding one pile-games, so the SG-value of the two-pile game is the xor (nim-sum) of the amounts. In both cases, the SG-value leaves remainder 2 upon division
are added up through which each path from z to s must go. In the diagram, there are three such choices providing the limits 12, 8,5 (from left to right). At the same time, in such a simple case the flow that realizes the maximal possible value is easily found. This idea can be formalized as follows:
Cut in a network
A cut in a network S = (V,E, z, s, w) is a set of edges C C E such that when these edges are removed, there remains no path from the source z to the sink s in the graph G = (V,E\G). The number
|G| = ^>(e)
eGC
is called the capacity of the cut C.
Clearly, there is no flow whose size is greater than the capacity of a cut. We present the Ford-Fulkerson algorithm8, which finds a cut with the minimum possible capacity as well as a flow which realizes this value. This proves the following theorem:
Theorem. In any network S = (V,E, z, s, w), the maximum size of a flow equals the minimum capacity of a cut in S.
The idea of the algorithm is quite simple. It looks for paths between the vertices of the graph, trying to "saturate" them with the flow. For this purpose, define the following terminology:
An undirected path from the vertex v to the vertex v' in a network S = (V,E, z, s, w) is called unsaturated if and only if all edges e directed along the path from v to w' satisfy /(e) < w(e) and the edges e in the other direction satisfy /(e) > 0 (sometimes, one tries to saturate the flow in the other direction; yielding a semipath, or the augmenting semipath). The residual capacity of an edge e is the number w(e) — /(e) if the edge is directed from v to w, and it is the number /(e) otherwise. The residual capacity of a path is defined to be the minimum residual capacity of its edges. For the sake of simplicity, assume that all the edge capacities are rational numbers.
Ford, L. R.; Fulkerson, D. R. (1956). "Maximal flow through a network". Canadian Journal of Mathematics 8: 399-404.
916
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
by 4 (consider the last two bits). In particular, it is surely not equal to 4k + 1. This proves the induction step for positive integers of the form 4k + 1.
The proof for integers of the form 4k + 2 is analogous. The situation is more amazing in the 4k +3 case: Similarly as above, it follows from the induction hypothesis that the SG-values of the one-pile positions we can move to exhaust all the non-negative integers up to 4k + 2. However, note that if we split the pile into two containing 1 and 4k + 2 tokens, respectively, then their SG-values are also 1 and 4k+2 by the induction hypothesis, and the xor of these integers is 4k+ 3. It remains to prove that there is no move into a position with SG-value 4k + 4: Again, the only remaining possibility is to split the existing pile. Then, the resulting remainders modulo 4 are either 0 and 3, or 1 and 2. By the induction hypothesis the remainders of the corresponding SG-values are respectively 3 and 0 in the former case, and 1 and 2 in the latter. In either case, the xor of these integers (and thus the SG-value of the resulting position) leaves remainder 3, so it is not equal to 4k + 4. This proves the induction step for positive integers of the form 4k + 3. The proof for integers of the form 4k + 4 is analogous. □
I. Generating functions
13.1.1. In how many ways can we buy 12 packs of coffee if we can choose from 5 kinds?
Further, solve this problem with the following modifications:
i) we want to buy at least 2 packs of each kind;
ii) we want to buy an even number of packs of each kind;
iii) there are only 3 packs of one of the kinds.
Solution. The basic problem is a classical example of a combinatorial problem on the number of 5-combinations with repetition-the answer is (12^]~1) = (146). The modifications can also be solved by combinatorial reasonings with a bit of invention. However, we want to demonstrate how these problems can be solved (almost without no need to think) using generating functions.
The wanted number corresponds to the coefficient at a;12 in the expansion of the function
(1 + x + x2 + ... f =
= (1 + X + . . + X + . . .) ■ ■ ■ (1 + X + . . .)
ford-fulkerson algorithm
Input: A network S = (V,E, z, s, w).
Output: A maximal possible flow / : E —> R and a minimal
cut G, which is given by those edges which lead from U to
V\U.
(1) Initialization: Set /(e) = 0 for each edge e e E, and using depth-first search from z, find the set U C V of those vertices to which there exists an unsaturated path.
(2) The main loop: While s e U, do
• select an unsaturated path P from the source z to the sink s; then increase the flow / along all edges of the path P by the value of the residual capacity of P.
• update U.
Proof. As seen, the size of any flow cannot exceed the capacity of any cut. Therefore, it suffices to show that when the algorithm terminates, the capacity of the generated cut equals the size of the constructed flow.
The algorithm terminates in the first moment when there jjfi i. is no unsaturated path from the source z to the sink !>» s. This means that U does not contain s and for all r\ijw§f edges e from U and ending outside of U, /(e) = w(e) (otherwise, the other endpoint of e would be added to
U).
For the same reason, all edges e leading from V \ U to U must have /(e) = 0.
Clearly, the total size of the flow satisfies
1/1= E /(e)   -    E /(e)
edges from U to V \ U   edges from V \ U to U
However, when the algorithm terminates, this expression equals
\c\= E /(e)   -    E /(e)
edges from U to V \ U   edges from V \ U to U
which is the desired result.
It remains to show that the algorithm always terminates. Since the edges are assumed assessed with rational numbers, it can be assumed by rescaling that the capacities are integers. Then every flow constructed during the run of the algorithm has integer size. In addition, every iteration of the main loop increases the size of the flow. However, since any cut bounds the maximum size of any flow from above, the algorithm must terminate after a finite number of steps. □
917
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
into a power series. The number of packs of the first kind determines which term is selected from the first parenthesis, and similarly for the other kinds. (Note that we need not pay special attention to that fact that there cannot be more than 12 packs of a given kind - it turns out that infinite series are usually simpler to work with than finite polynomials.) Since
1 -, 2
= 1 + x + xl + . . .
1
(see 13.4.3), the function we are considering is (1 — a;)-5. Our task is thus to expand (1 — a;)-5 into a power series. By the generalized binomial theorem, from 13.4.3, the coefficient at xk is the number (kt>^1), which is (146) in our case. Note that using generating functions, we have answered the question not only for 12, but rather for an arbitrary number of packs of coffee.
The modifications can be solved analogously:
i) The generating function is
(x2 + x3 + ... )5 =
1-x
(2tY)
(1-x)
hence the coefficient at x12 is equal to
ii) An even number of each kind corresponds to the generating function
(1 + ^ + ^ + ...)B = _J__.
The coefficient at x12 can be found by many means; the easiest one seems to be the substitution y = x2 and looking for the coefficient at y6 (which can be perceived as joining the packs into pairs in the shop). This leads to the answer fl^1)-
iii) In this case, the generating function equals
(l + a; + a;2 + a;3)(l+a; + a;2 + ...)4
and the wanted result is thus
(9:y)- °
13.1.2. In how many ways can we use the coins of values 1, 2, 5, 10, 20, and 50 crowns to pay exactly 100 crowns?
Solution.   We are looking  for non-negative integers
oi, 0,2, a5, a 10, h20> and a50 such that a, is a multiple of i for all i e {1,2,5,10,20,50} and, at the same time, o\ + a2 + a5 + a10 + a2o + a50 = 100. We can see that the wanted number of ways can be obtained as the coefficient
The run of the algorithm is illustrated in two diagrams. On the left, there are two shortest unsaturated paths from the source to the sink in gray (the upper one has two edges, while the lower one has three). On the right, another path is saturated (taking the first turn in the upper path), also drawn in gray. Now, it is apparent that there can be no other unsaturated path from the source to the sink. Therefore, the algorithm terminates at this moment.
13.2.12. Further remarks. The algorithm allows for further conditions incorporated in the problem. For in-stance, capacity limits can be set for the vertices of the network as well. There are not only upper limits for the flows along particular edges or through vertices, but also lower ones.
It is easy to add vertex capacities - just double every vertex (one for the incoming edges, the other for the outgoing edges), connecting each pair with an edge of the corresponding capacity.
The lower limits for the flow can be included in the initialization part of our algorithm. However, one needs to check whether such a flow exists at all. Many other variations can be found in literature.
On the other hand, the algorithm does not necessarily ter-
.r.,        , .. ,    , r ,     Put some nice example
minate if the edge capacities are irrational. Moreover, the m the other column, flows that are constructed during the run may not even con-f*
0 J httpsy/www.cs.prrnce-
verge to the optimal solution in such a case. However, it still ton.edu/com-s-
es/archive/springl3/cos423/l<
holds that if the algorithm terminates, then a maximum now tmes/twDemoFordFuik-
iS fOUnd. ersonPathological.pdf
If the capacities are integers (equivalently rational numbers), the running time of the algorithm can be bounded by 0(f\E\), where / is the size of a maximum flow in the network and \E\ is the number of edges. The worst case occurs if every iteration increases the size of the flow by one.
In the proof of correctness, no explicit way of searching the graph when looking for an unsaturated path is used. Another variation of the Ford-Fulkerson algorithm is to use breadth-first search. The resulting algorithm is called Edmonds-Karp, and its running time is 0(|V|\E\2).9 We mention Dinic's algorithm, which simplifies the search for an unsaturated path by constructing the level graph, where augmenting edges are considered only if they lead between
Edmonds, Jack; Karp, Richard M. (1972). "Theoretical improvements in algorithmic efficiency for network flow problems". Journal of the ACM (Association for Computing Machinery) 19 (2): 248-264. doi: 10.1145/321694.321699.
918
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
at a;100 in the product
(1 + x + x2 + . .. )(1 + x2 + x4 + ...)(1 + x5 + x10 + . ■(1 + x10 + x20 + ...)(l + x20 +x40 + ...)■ ■ (1 + x50 + x100 + ...) =
1
1
1
1
1
1
1 -x   1 -x2   1 -x5   1 -x10   1 -x20 1-x50' The result can be obtained using the software of SAGE, for instance (the names of the used commands are pretty self-descriptive, aren't they?):
sage:  f=1/(1 — x)*l/(l-xA2)*l/(lx A5)\
*1/(1- xA10)*l/(l -xA20)*l/(l -xA50) sage:  r=taylor(f,x,0,100) sage:  r.coeff(x,100)
4562
□
13.1.3.   Expand the following functions into power series:
L> x+2' jj\ x2+x+l ll> 2x3+3x2+l-
Solution.
i)
X
x/2
x + 2     2-(-x) l-(-x/2)
n            q                             oo „
=----1-------h=>   (-1) -.
9       4        8                    ^ 9™
ii) We perform partial fraction decomposition:
x2 +x + 1
X2 + X + 1
2x^ + 3x2 + \     (x -l)2(2x + 1)
A B
H--7 + ■
C
2x + l ' x-l ' (x-1)2'
finding out that A = B = i and C = 1; hence
1/3 1/3
x2 + x + 1
+
2a;3 + 3a;2 + 1     1 + 2a;     1-x (1-x)
Eoo n=
_((-2)"-l)) + (n + l)
□
13.1.4. Find the generating functions of the following sequences:
i) (1,2,3,4,5,...),
ii) (1,4,9,16,...),
iii) (1,1,2,2,4,4,8,8,...),
vertices whose distances from the source differ. The time complexity of this algorithm is 0(\V\2\E\), which is much ' better for dense graphs than the Edmonds-Karp algorithm.
13.2.13. Problems related to flow networks. A good application of flow networks is the problem of bipartite matching. The task is to find a maximum matching in a bipartite graph, i.e. a set of as many edges as possible so that each vertex of the graph is the endpoint of at most one of the selected edges.
This is an abstract variation of a quite common problem. For instance, it may be needed to match boys and girls in dancing lessons, provided information about which pairs would be willing to dance together is given.
This problem is easily reduced to the problem of maximum flow. Add an artificial source to the graph and connect it with edges to all vertices of one part of the bipartite graph, while the vertices of the other part are connected to an artificial sink. The capacity of each edge is set to one, and the resulting graph is searched for the maximum flow. Then, the edges that are used in the flow correspond to the selected pairs. Of course, information on which pairs to put together by leaving some of them out, may be included.
Another important application of flow networks is the J.i i proof of Menger's theorem (mentioned as a theorem in 13.1.10). It can be understood as follows: Given a directed graph, set the capacity of each edge as well as edge vertex to one. Further, select an arbitrary pair of vertices v and w, which are considered to be the source and the sink, respectively. Then, the size of a maximum flow in this graph equals the maximum number of disjoint paths from vtow (the paths may share only the source and the sink). Every cut divides v and w into different connected components of the remaining graph (since they are chosen to be the source and sink). The desired statements then follow from the fact that the size of a maximum flow equals the capacity of a minimum cut.
13.2.14. Game trees. We turn our attention to a very broadly used application of tree structures when analyzing possible strategies
?A or procedures. They can be encountered in the theory of artificial intelligence as well as in the game theory. They play an important role in economics and many other social fields.
This is about games. In the mathematical sense, game j theory examines models in which one or more players take turns in playing moves according to predetermined and generally known rules. Usually, the moves are assessed with profits or losses for the given player. Then, the task is to find a strategy for each player, i.e. an algorithmic procedure which maximizes the profits or minimizes the losses.
We use an extensive description of the games. This means that a complete and finite analysis of all possible states of the game is given, and the resulting analysis gives an exact account about the profits and losses. This is supposing that the other players also play the best moves for them.
919
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
iv) (9, 0,0,2 ■ 16,0,0,4 ■ 25, 0,0,8 ■ 36,
v) (9,1, -9,32,1, -32,100,1, -100,..
o
13.1.5. In how many ways can we buy n pieces of the following five kinds of fruit if we do not distinguish between particular pieces of a given kind, we need not buy all kinds, and:
• there is no restriction on the number of apples we buy,
• we want to buy an even number of bananas,
• the number of pears we buy must be a multiple of 4,
• we can buy at most 3 oranges, and
• we can buy at most 1 pomelo.
Solution. The generating function for the sequence (a„) where an is the wanted number of ways to buy n pieces of fruit is
(1 + x + x2 + ■ ■ ■ )(1 + x2 + x4 ■ ■ ■ )(1 + x4 + xs + ■ ■ ■ )■ ■ (l + a; + a;2 + a;3)(l + a;) =
1 1 1       1 - x4
1 — x   1 — x2   1 — x4    1 — X
1
(l+x)
(1 - a)3'
By the generalized binomial theorem, we have (1 — x)~3 = J2n°=o ("J2)1™ - Therefore, the wanted number of ways satis-
fies a„
(n+2\
□
13.1.6. Using the generalized binomial theorem, prove again the following combinatorial identities:
• E£=o(-i)fcG) = o,
• ELo*a)="2"-1.
A game tree is a rooted tree whose vertices are the possible states of the game and they are labeled according to whose turn it is. The outgoing edges of a vertex correspond to the possible moves of the player from that state. This complete description of a game using the game tree may be used for common games like chess, naughts and crosses (known also as tic-tac-toe), etc.
As a simple example, consider a simple variation of the game known as Nim.w
There are k tokens on the table (the tokens may be sticks or matches), where k > 1 is an integer, and players take turns at removing one or two tokens. The player who manages to take the last token(s) wins. There is a variation of the game in which the player who is forced to take the last token loses. The tree of this game, including all necessary information about the game, can be constructed as follows:
• The state with £ tokens on the table and the first player to move corresponds to the subtree rooted at Ft. The state with the same number of tokens but the second player to move is represented by the subtree rooted at St.
• The vertex Ft has St-i as its left-hand son and St-2 as its right-hand son. Similarly, the sons of the vertex St are Ft_1 and Fi_2-
• The leaves are always F0 or So. (In the variation when the player to take the last token loses, these would be the states F1 and Si.)
Every run of the game starting at root Fk corresponds to exactly one leaf of the resulting tree. Therefore, the total number p(k) of possible runs for Fk is equal to
p{k)=p(k-\)+p(k-2)
for k > 3, and clearly p(l) = 1 and p{2) = 2. This difference equation is already considered. It is satisfied by the Fibonacci numbers, which can be computed by an explicit formula (see the subsection on generating functions in the end of this chapter, or the corresponding part about difference equations in chapter three, cf. 3.B.1). A formula is known for the number of possible runs of the game. The number of possible states equals the number of all vertices of the tree. The game always ends in a win of one of the players. We can also consider games where a tie is possible.
13.2.15. Game analysis. The tree structure allows an analysis of the game so that an algorithmic strategy for each player can be built. This is done with a sim-^ pie recursive procedure for assessing the root of r^ a subtree. Each vertex is given a label: W for vertices where the first player can force a win, L for those where the first player loses if the other one plays optimally, and, optionally, T for vertices where optimal play of both players results in a tie. The procedure is as follows:
10The game was given this name by Charles Bouton in his analysis of this type of games from 1901. It refers to the German word "Nimm!", meaning "Take!".
920
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Solution. Substituting into the binomial theorem
(i+""=(j)+(t>+G)*"+-+(:>"
the numbers x = 1 and x = —1, we obtain the first and second identities, respectively. Then, the third one can be obtained by viewing both sides of the binomial theorem "continuously" and using the properties of derivatives. □
13.1.7. In a box, there are 30 red, 40 blue, and 50 white balls. Balls of one color are indistinguishable. In how many ways can we select 70 balls?
Solution. The wanted number is equal to the coefficient at x70 in the product
(1 + x + ■ ■ ■ + x30)(l + X + ---+ xi0)(l + X + ---+ x50).
This product can be rearranged to
(1 - a;)-3(l - a;31)(l - xA1)(l - x51), whence, using the generalized binomial theorem, we obtain
2) + (2) X+ (2)^ '' ) (l--31--41--51+-72+- • 0
Hence, the coefficient at x70 is clearly (70+2) - (70+2,~31)
1061.
□
13.1.8.   Prove that
n
YJHk = {n+\){Hn+1-\).
k=l
Solution. The necessary convolution can be obtained as the product of the series j^— and -A— In A—. Hence:
r 1—x 1—x 1—x
k=l
whence the wanted identity follows easily. □ 13.1.9.   Solve the recurrence
ao = «i = 1,
an = a„_i + 2a„_2 + (-1)™.
Solution. As always, it may be a good idea to write out a few terms of the sequence (however, this will not help us much in this case; still, it can serve as verification of the result).2
Step 1: an = a„_i + 2a„_2 + (-l)n[n > 0] + [n = 1].
Step 2: A(x) = xA(x) + 2x2A(x) + +x.
(1) The leaves are labeled directly according to the rules of the game (in the case of our Mm, the leaves So are labeled by W, and the leaves F0 by L).
(2) Considering the vertex Fe. Label it W if it has a son who is labeled by W. If there is no such son, but there is a son labeled by T, then Fe is given the label T. Otherwise, i.e. if all sons are labeled by L, then Fe also gets L.
(3) Similarly, a vertex Se is labeled L if it has a son labeled by L. Otherwise if it has a son labeled by T, it receives T. Otherwise (i.e. if it has only W-sons), it is labeled by W.
Calling this procedure on the root of the tree gives the labeling of each vertex as well as an optimal strategy for each player:
• The first player tries to move to a vertex labeled by W; if this cannot be achieved, he moves to a T-vertex at least.
• Similarly, the second player tries to move to a vertex labeled by L; if this cannot be achieved, he moves to a T-vertex at least.
The depth of the recursion is given by the depth of the tree. For instance, having a Mm game with k tokens, the depth is k.
This analysis is not very useful yet. In order to use it in the mentioned form, the entire game tree is needed for disposal. This can be a great amount of data (for instance, in the case of naughts and crosses on 3 x 3 playground, the corresponding tree has tens of thousands of vertices). Usually, the analysis with game tree is used when only a small part of the whole tree is examined, applying appropriate heuristic methods, and the corresponding part is being created dynamically during the game. This is a fascinating field of the modern theory of artificial intelligence. The details are omitted.
There is a more compact representation of the tree structure for our purposes of complete formal analysis. If the game tree for Mm is drawn, then one state of the ► game is represented by many vertices which corre-f & 1 spond to different histories of the game. However, the strategies depend only on the actual state (i.e. the number of tokens and the player to move) rather than on the history of the game. Therefore, the same game can be described by a graph where for each number of tokens, there is only one vertex, and the whole strategy is determined by identifying who is winning (whether this is the player on move or the other one) Directed edges are used for the description of possible moves and there is then always an acyclic graph.
Despite the statement in Concrete mathematics, this sequence can already be found in The On-Line Encyclopedia of Integer Sequences.
921
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Step 3:
A(x)
1 + x + x
(l-2a;)(l + a;)2
Step 4: an = l2n+ (±n + §)(-!)"
□
13.1.10. Quicksort - analysis of the average case. Our task is to determine the expected number of comparisons made by Quicksort, a well-known algorithm for sorting a (finite) sequence of elements.
An example of a simple divide-and-conquer implementation:
if L ==  []:   return [] return  qsort([x for x + L[0:1]
+ qsort([x for x
It is not too difficult to construct a formula for the number of comparisons (we assume that the particular orders of the sequence to be sorted are distributed uniformly). The following parameters are important for the analysis of the algorithm:
i) The number of comparisons in the divide phase: n—1.
ii) The uniformness assumption: the probability that L [0] is the fc-th greatest element of the sequence is ^.
iii) The sizes of the sorted subsequences in the conquer phase: k — 1 and n — k.
We thus get the following recurrent formula for the expected number of comparisons:
The example of the game Mm is displayed on the diagram. On the left, there is a complete game tree corresponding to three tokens. The directed graph on the right represents the game with seven tokens. A complete tree for this game would already have 21 leaves, and the number of leaves grows exponentially with the number of tokens.
The individual vertices in the directed acyclic graph on the right-hand side of the diagram indicate the number of tokens left and the information whether the game at that state is won by the player who is to move (letter N as "next") or the other one (letter P as "previous"). Altogether, considering a game with k tokens, this graph always has only k + 1 vertices. At the same time, there is the complete strategy encoded in the graph: The players always try to move from the current state into a vertex labeled by P if such one exists. 0 ] ]) In fact, every directed acyclic graph can be seen as a description of a game. The initial situations are represented by in L[l.] if x>=L| [QitJ^ vertices which have no incoming edges (there can be one or more of them), and the game ends in leaves, i.e. vertices with no outgoing edges (again, there can be one or more of them).
The strategy for each player can be obtained by a simple recursive procedure as above (for the sake of simplicity, it is assumed that there is no tie):
• The leaves are labeled by P (the player who is to move from a leaf loses).
• A non-leaf vertex of the graph is labeled by N if there is an edge leading to a P-vertex. Otherwise, it is labeled P.
in L[l:]   if x<L[
n ^
Cn = n - 1 + V - (Cfc_i + Cn-k). z—' n
k=l
It is possible to solve this recurrence (using certain tricks which can be learned to some extent) even without using generating functions.
2
Cn = n - 1 +
In the case of our variation of Mm, the situation is very simple. It follows from the strategy described that the player who is to move loses if and only if the number of tokens is divisible by three.
The games that can be represented by a directed acyclic graph are called impartial. These are exactly those games which satisfy:
• in every state, both players choose from the same set of moves;
• the number of possible states is finite;
• the game has "zero sum", i.e. the better the outcome for symmetry of both sjmbi the players, the worse for the other one.
An example of an impartial game is tic-tac-toe. Although the multi I b rjP^ers use different symbols in this game, they can place them in any of the unoccupied squares. On the other hand, chess is not an impartial game in this sense, since the set of
the same expMsMjJP^P every situation dePends on the number of pieces the players have at their disposal.
subtracted an^^nge^ of combinatorial games. ^ mles of the real
classical game Mm are somewhat more com-plicated: There are three piles of tokens. The >-45SE5S-   move consists of selecting one of the piles and _       removing an arbitrary (positive) number of to-On the other hand, this equation contains non-constant coef-   kens from that pile ^ player who manages to take the last
ficients as well. token wins. There is a variation of the game in which the
k=l
n
nCn =n(n-l) + 2^Cfc_i
(n - l)Cn
■>k-l
:(n-l)(n-2) + 2^Cfc
k=l
nCn = (n + l)C„_i + 2(n - 1) We have thus obtained a much simpler recurrence:
nCn = (n + l)C„_i + 2(n - 1).
922
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
We can also note that the recurrence has been simplified to the extent that the values Cn can be computed iteratively. Nevertheless, it is advantageous to express these values explicitly as a function of n (or at least to approximate them).
First, we use a slight trick: dividing both sides by n(n +
1):
Cn   _Cn_i 2(n-l)
71+1 71 7l(71 + 1)
Now, we "expand" this expression (telescope, we can also use the substitution Bn = Cn/n+ 1):
Cn   _ 2(7i-l)     2(7i-2) 2-1 Ci
Ti + l     ti(ti+1)     (ti-I)ti    "'    2-3     2 ' Hence
r 71-1 k
— 2
71+1 ~   ^ (fc+l)(fc + 2)' This can be summed up using partial fraction decomposition,
for instance:
(fc+l)(fc+2) k+2 k+1
= 2 (Hn+i — 2 +
— ttt • This leads to
Ti + l       \ Ti + ly
whence
G„ = 2(ti + l)ffn+1- 4(7i + l) + 2
= 5Zfc=i i is me sum °f the firstn terms of the harmonic progression). At the same time, we can give the bound Hn ~
Ii ~f + 7 wnence
Cn ~ 2(?i + 1) (ln(?i + 1) + 7 - 2) + 2.
13.1.11. Using the generating function -F(a;) = x/(l —x—x2) for the Fibonacci sequence, find the generating function for the "semi-Fibonacci" sequence (F0, F2, F4,...). O
13.1.12. The fan of order ti is a graph on ti + 1 vertices, which are labeled 0,1,..., ti, with the following edges: vertex 0 is connected to all other vertices, and for each k satisfying 1 < k < ti, vertex k is connected to vertex k + 1. How many spanning trees does this graph have?
Solution. Denoting by vn the wanted number of spanning trees, we clearly have v1 = 1, and since the fan of order 2 is the triangle graph K3, we have v2 = 3. Further, we are going to show that for ti > 1, the following recurrence3 holds:
n-l
Vn = Vn-1 + YVk + 1,     V° = °' k=0
player who is forced to take the last token loses. If this game is considered with one pile, the situation is easy: The first player takes all the tokens and wins immediately. However, it is not that easy with three piles. Whether the analysis of the one-pile game is of any use for this more complicated game is a good question..
For this purpose, introduce a new concept, the sum of impartial games: A situation in the game composed of two simpler games is a pair of possible situations in the particular games. Then, a move consists of selecting one of the two games and performing a move in that game (the other game is left unchanged). Therefore, the sum of impartial games is an operation which assigns to a pair of directed acyclic graphs a new one.
Considering graphs Gx = (Vi,Ei) and G2 = (V2,E2), its sum G i +G2 is the graph G = (V, E), where V = V\xV2 and
E = {(uii<2, wiv2); (di,ioi) e E±}
U {(v1v2,v1w2y, (v2, w2) G E2}.
In the case of one game, the vertices can be labeled step-by-step by the letters 7Y and P in an upwards v,i, manner, according to whether one can get to a P-vertex along some of the edges. However, in the sum of games, movement along the edges is needed in a much more complicated way. Therefore, finer tools are needed for expressing the reachability of vertices labeled by P from other vertices.
This needs some preparation which might seem like a strange magic (but the proof of the theorem below shows that all this is quite natural). Define the Sprague-Grundy function recursively, g : V —> N on a directed acyclic graph G = (V,E) as follows:11
(1) for a leaf v, set g(v) = 0;
(2) for a non-leaf vertex v G V, define
g(v) = mex{g(w); (v, w) is an edge},
where the minimum excluded value function mex is denned on subsets S of the natural numbers N = {0,1,... } by
mex S = min N \ S.
The function g(v) is just the mex S operation for the set S of the values g(w) over those vertices w where one can get along an edge from v.
Note that this definition is correct since, clearly, the formula uniquely defines a function that assigns a natural number to any vertex in the acyclic graph in question.
Yet another operation on the natural numbers is needed. It is the binary XOR operation
(a,b) i-+ a © b,
Using this recurrent formula to calculate more values v„, we find out that i)3 — 8, i>4 — 21, which suggests a hypothesis about connection with the Fibonacci sequence in the form u„ — F2„. This can be proved easily by induction.
11 We are presenting the theory which was developed in combinatorial game theory independently by R. P. Sprague in 1935 and P. M. Grundy in 1939.
923
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
For a fixed spanning tree of the fan of order n, let k be the greatest integer of the set {l,...,n — 1} such that the spanning tree contains all edges of the path (0,1,2,3,..., k). This spanning tree cannot contain the edges {0,2},..., {0, k}, {k, k + 1}; therefore, there are the same number of spanning trees for a fixed k as in the fan of order n — k with vertices 0,k + l,k + 2,...,n, i. e. Dn_i. Further, we must count one spanning tree for fc = n and those spanning trees which do not contain the edge {0,1} (thus they must contain the edge {1,2}) - they are obtained from fans of order ti — 1 on vertices 0, 2,..., ?i. We have thus obtained the wanted recurrence
Vn = Vn-l + Vn-1 + Vn-2 H-----\-V0 + 1.
Now, we have the general formula
n-1
Vn = Vn-1 + Yvk + 1 - [n = 0], k=0
whence the usual procedure for finding the generating function V(x) of this sequence yields
oo ^
V(x) = x ■ V(x) + YY, Vkxn + - 1 =
n=0k<n
oo
= x-V(x) + ^2^2vkxn + T^ =
k=0n>k
E
k=0
oo
E
k=0
VkX
VkX
n>k
X X
1- + 1-
1 — x 1 — X
V{x)
1 -X + 1
The solution of the equation V(x) = xV(x) + jz^;V(x) +
x
V(x)
1 — 3a; + x2 '
whence using the standard method (partial fraction decomposition) or the previous problem leads to the result vn = F2n.
□
Recursively connected sequences. Sometimes, we are able to express the wanted number of ways or events only in terms of more mutually connected sequences.
13.1.13. In how many ways can we cover a 3 x 71 rectangle with 1x2 domino pieces? Evaluate this value for 71 = 20.
Solution. We can easily find out that c\ = 0, c2 = 3, c3 = 0, and it is reasonable to set c0 = 1 (this is nor merely convention; there is indeed a unique empty covering).
performing the exclusive-or operation bit-wise on the binary expansions of a and b. This operation can be considered from the following point of view: Consider the binary expansions of a and b to be vectors in the vector space (Z2)fe over Z2 (for a sufficiently large k), and add them there. The resulting vector is the binary expansion of a ffi b. Now the main result can be formulated:
Sprague-Grundy theorem
13.2.17. Theorem. Consider a directed acyclic graph G = (V,E). Its vertices v are labeled by P if and only if g(v) = 0, where g is the Sprague-Grundy function.
For any two directed acyclic graphs Gi = {Vi,E{), G2 = (V2,E2) and their Sprague-Grundy functions g 1, g2, the Sprague-Grundy function g of their sum is given by
g(viv2) = gi(vi) ffig2(«2).
Proof. The first proposition of the theorem follows directly by induction from the definition of the Sprague-Grundy function g.
The proof of the other part is more complicated. Let (V1V2) be a position of the game = (V,E), and consider any a e No such that a < gi(vi) ffi g2(w2). There exists a state a;ia;2 of the game Gi +G2 such that g(xi) ffig(a;2) = a and (yiv2, x\x-f) £ E, and, at the same time, there is no edge (yiv2, yiy2) £ E such that
gi{vi) ®52(y2) = 51(^1) (Bg2(v2).
This justifies the recursive definition of the Sprague-Grundy function and proves the rest of the theorem. 5     To show why, find a vertex a;ia;2 with a given value a < Z3ji(vi) © 52(^2) of the Sprague-Grundy function.
Consider the integer b := a ffi gi(vi) ffi g2(y2). Refer to the bit of value 2J as the i-th bit of an integer. Clearly, b ^ 0. Let k be the position of the highest one in the binary expansion of b, i.e. 2k < b < 2k+1. This means that the fc-th bit of exactly one of the integers a, gi(vi) ffi g2(y2) is one and that these integers do not differ in higher bits. It follows from the assumption a < <7i(i>i) ®g2(y2) thatitis the integer gi (yi) ffi g2(y2) whose fc-th bit is one.
Therefore, the fc-th bit of exactly one of the integers <7i(i>i), 32(^2) is one. Assume without loss of generality that it is the integer <7i(i>i). Further, consider the integer c := <7i(i>i) ffi b. Recall that the highest one of b is at position k, so the integers c, g\ (v\) do not differ in higher bits and the fc-th bit of c is zero. Therefore, c < <7i(i>i). Then, by the definition of the function value g\(v\), there must exist a state w\ of the game Gi such that (vi,wi) £ E\ and g\(w\) = c. Now, [viv2,wiv2) £ E and
gi{w{) ffi 52(^2) = cffi g2(v2) = gx(vx) ®b®g2{v2)
= gi{vi) ffi a ffigi(wi) ffi 52(^2) ©52(^2) = a.
This fulfills the first part of our plan.
924
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
We are looking for a recursive formula-discussing the behavior "on the edge", we find out that cn = 2r„_i + c„_2, rn = c„_i + r„_2, ro = 0, r\ = 1, where rn is the number of coverings of the rectangle 3 x n without one of the corner tiles.
The values of cn and rn for the first few non-negative integers n are:
0   1   2   3   4    5    6 7
1   0   3   0   11   0   41 0
0   1   0  4   0   15   0 56
• Step 1: cn = 2r„_i + c„_2 + [n = 0],   r„ = c„_i
• Step 2:
C(a) = 2ai?(a)+a2C(a)+l,    R(x) = xC(x)+x2 R(x).
+
i Step 3:
C(x) =
1
i?(a)
1 - 4a;2 + x4 ' w     1 - 4a2 + a4 '
• Step 4: We can see that both are functions of a2. We can thus save much work if we consider the function D (z) =
1/(1 -4z + z2). Then, we have C(a) = (1-a2)D(a2), i. e., [a2re]C(a) = [a2re](l - - a2)D(a2) = [are](l -a)D(a), so c2n = dn - d„_i.
The roots of 1 — 4a + a2 are 2 + \/3 and 2 — \/3, whence the standard procedure yields
(2 + V3)n    (2 - V3)n
C2n
+
3-V3        3 + V3 Just like with the Fibonacci sequence, the second term is negligible for large values of n and is always between 0 and 1. Therefore,
"(2 + V3)n~
C2n- 3-V3 For instance, c2q = 413403.
□
13.1.14. Using generating functions, find the number of ones in a random bit string.
Solution. Let B the set of bit strings, and for b e B, let \b\ denote the length of b and j(b) the number of ones in it. The generating function is of the form
7J(a) = ^albl=^2"a" = T^
•>eB n>0
ction for the num
C(x) = J2j(b)
2a
beB n>0
The generating function for the number of ones is
Further, consider any edge (viv2, yiy2) G E in G, where (vi,yi) £ Ei, and hence v2 = y2. Suppose that gi(yi) ffi 92{V2) = gi(v1)®g2(v2). Then, gi(yi)(Bg2(v2) = gi(vi)(B 52(^2). Clearly, the terms 52(^2) can be canceled (it is an operation in a vector space), leading to gi(yi) = <?i(i>i). This contradicts the properties of the Sprague-Grundy function gi of the game G\. This proves the second part of the theorem.
□
The following useful result is a direct corollary of this theorem:
Corollary. A vertex viv2 in the sum of games is labeled by P if and only if gi(vi) = g2(v2).
For example, if three piles of tokens are combined in the simplified Nim game (it is always allowed to take only one or two tokens), the first player always wins, if all three piles have the same number of tokens, not divisible by three. The individual functions gi (k) for the individual piles equal the remainder after dividing k by 3. It follows that, when summing the first two Nim pile games, the value g(y) = 0 is obtained for the initial position. Summing again with another pile game gives g(y) =/ 0.
In the original game, the individual piles are described by g[k) =k (any number of tokens can be chosen, hence the function g grows in this way). The losing positions are those, where the binary sum of the numbers of tokens is zero. For example, if the two of the initial piles are of equal size, then a simple winning strategy is to remove the third one completely and always make the remaining two equal after the opponent's move.
Remark. Further details are omitted in this text. It can be proved that every finite directed acyclic graph is isomorphic to a finite sum of suitably generalized games of Nim.
In particular, the analysis of the simple game and construction of the function g basically (at least implicitly) gives a complete analysis of all impartial games.
3. Remarks on Computational Geometry
A large amount of practical problems consist in con-structing or analyzing some finite geometrical l '^iT-yf1'   objects in Euclidean spaces, mainly in 2D or 3D.
I'kfe *s a very ^usy °f kotn applications and research. At the same time, most of the algorithms and their complexity analysis are based on graph theoretical and further combinatorial concepts. We provide several glimpses into this beautiful topic.12 We discuss convex hulls, triangulations, and Voronoi diagrams and focus on a few basic approaches only.
beB
The beautiful book Computational Geometry, Algorithms and Applications by de Berg, M., Cheong, O., van Kreveld, M., Over-mars, M., published by Springer (1997) can be warmly recommended, http://www.springer.com/us/book/9783540779735
925
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
A string b can be obtained from the one bit shorter string b' by adding either a zero or a one, i. e., j(b) is the sum of j(b') ones and j(b') + 1 ones. Therefore,
C(x) = £ (1 + 2j(&'))*|b'l+1 = £ *'b''+1 + 2 £ j(b';
Hence
b'GS
+ 2a;C,(a;).
C(a;)
b'eB
b'eB
■ x(l-2x)-
(1 - 2*
and the n-th coefficient is c„ = 2n~1(~^1) = n2n_1. This number gives the number of ones in strings of length n, and there are bn = 2n such strings. Therefore, the expected number of ones in such string is = §, which is, of course, what we have anticipated. □
13.1.15. Find the generating function and an explicit formula for the n-th term of the sequence {a„} denned by the recurrent formula
ao = 1, «i = 2
an = 4a„_i — 3a„_2 + 1 for n > 2. Solution. The universal formula which holds for all n e Z is
a„
4a.
n-l
3a„
+ 1 - 3\n = 11
13.3.1. Convex hulls. We start with a simple and practical problem. In the plane R2, suppose n points X = {v1,. are given and the task is to find their convex
hull CH(X). As learned in the Chapter 4, CH(X) is given xlby,Haiconvex polygon and it is desired to find it effectively.
First we have to decide how CH(X) should be encoded as a data structure. Choose the connected list of edges. There is the cyclic list of the vertices v{ in the polygon, sorted in the counter clock-wise order, together with pointers towards the oriented segments between the consecutive vertices (the edges). Moreover, there is the list of edges pointing to their tail and head vertices.
There is a simple way, to get CH(X). Namely, create the oriented edges for all pairs (yi,Vj) of the points in X, and decide whether belongs to CH(X) by testing whether all the other points of X are on the left of (in the obvious sense). It is known already from chapter one, that this is tested in constant time by means of the determinant. Clearly belongs to CH(X) if and only if all the latter tests are positive. In the end, the order in which to sort the edges and vertices in the output is found.
This does not look like a good algorithm, since 0{n2) edges need to be tested against 0{n) points. Hence, cubic time complexity is expected. But there is a strong simple improvement available. Consider the lexicographic order on the points v{ with respect to their coordinates. Then build the convex hull consecutively and run through the tests only for the edges having the last added vertex as their tail.
Multiplying by xn and summing over all n, we get the follow ing equation for the generating function: A (x Hence, we can express
3a;2-3x4-1       3    1 1
J2 anXn.
71 = 0
A(x)
(l-x)2(l-3x)
41-1
1
Therefore, the coefficient at xn is
a
3
» + 4'
1
1 - 3a;
l(n + l) + l^-
d1-
-2re+3"
4-
□
13.1.16. Solve the following recurrence using generating functions:
ao =        = 2
an = 5a„_i — 4a„_2   n > 2 Solution. The universal formula is of the form
an = 5a„_i - 4a„_2 - 3[n = 1] 4- [n = 0]
Multiplying both sides by xn and summing over all n, we obtain
A(x) = 5xA(x) - 4x2A(x) - 3a; + 1.
Gift Wrapping Convex Hull Algorithm
Input: A set of points X = {v1,..., vn} in the plane. Output: The requested edge list for CH(X).
(1) Initialization. Take the smallest vertex v0 in the lexicographic order with respect to the coordinates, and set
v active = ^0-
(2) Main cycle.
• Test edges with tail wactiVe, until e belonging to CH(X) is found.
• add e to CH(X) and set its head to be the wactive
• if ^active 7^ vo, then repeat the cycle.
Obviously, the most left and lowest vertex v0 in X is in CH(X). Since CH(X) is a cycle (as a directed graph), the algorithm works correctly. It is necessary to be careful about possible collinear edges in CH(X) and the lack of robustness of the test for those nearly collinear.
926
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Hence
A{x) =
and
1 - 3a;
(1 -4x)(l -x)
2
š' r
1     1 1
~ + 3 ' ~
Ax
^ = l(n)+l(n)(-4)B-
4™ + 2
□
3\n;     gvrwv    / g
13.1.17. A cash dispenser can provide us with banknotes of values 200, 500, and 1,000 crowns. In how many ways can we pick 7,000 crowns? Use generating functions to find the solution.
Solution. The problem can be reformulated as looking for the number of integer solutions of the equation
2a+ 56+10c =70;   a, b, c > 0.
This number is equal to the coefficient at x70 in the function
G(x) = (l+a;2+a;4+- ■ ■ )(l+a;5+a;10+- ■ ■ )(l+a;10+a;20+- ■ This function is equal to
G(x)
1
1
1
and since
1-a;10
1 + xb and ■
1— x2l — x5l — X 1-x10
10
1 + a;2 + a;4 + a;6 + a;8,
1 — x5 1 — x2
we can transform it into the form
, , _ (l + a;2 + a;4 + a;6+a;8)(l + a;5) G[X)~ (1-^0)3 By the binomial theorem, we have
(l-z10)3 = £(-!)*
fc=0
k
„10fc
Therefore, G(x) equals
(l+x2+x4+x5+x6+xr+x8+x9+x11+x13)J2(-1')k
fc=0
The term x70 can be obtained only as 7 ■ 10 + 0, i. e., the coefficient at x70 is equal to
[xro]G(x)
(3+r-i)
9-8
36.
□
This simple improvement reduces the worst running time of the algorithm to 0{n2). The worst case can be obtained if all the points v{ appear on one circle, and unluckily the right next point is always found as the very last one in partial tests. But the actual running time is much better, at most 0{ns), jvhere s is the size of the CH(X).
For example, in situations where the distribution of the points in the plane is random with normal distribution (see the chapter 10 for what this means), then it is known that the expected size would be logarithmic.
At the same time, finding CH(X) for X distributed on a circle is equivalent to sorting the points along the circle. So the worst time run cannot be better than O(nlogn) for all algorithms (cf. 13.1.17).
13.3.2. The sweep line paradigm. We illustrate several main approaches to computational geometry algorithms for the same convex hull problem.
The latter algorithm is close to the idea of having a special object L running through all the objects in the input X consecutively, taking care of constructing the relevant partial output of the algo-^— rithm on the run. This is the event structure describing all the events needing consideration, and the sweep line structure carrying all the information to deal with the individual events. The procedure is similar to the search over a graph discussed earlier. For a shortcut, this may reduce the dimension of the problem (e.g. from 2D to ID) on the cost of implementing dynamical structures.
To start, initialize the queue of the events and begin dealing with them. At each step, there are the still sleeping events (those further in the queue not yet treated), the current active events (those under consideration) and the already processed ones.
With the CH(X), this idea can be implemented as follows. Initialize the lexicographically ordered points Vi in X. Notice that the first and last ones necessarily appear in the CH(X). This way, CH(X) splits into the two disjoint chains of edges between them. We call them the upper and the lower convex hulls. Hence the entire construction can be split into the upper convex hull and lower convex hull.
927
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
As in the diagram, moving through the events one by one, it is only needed to check, whether the edge joining the last but one active vertex with the current one makes a left or right turn (as usual, right means the clockwise orientation). If right, then add the edge to list, if left, omit the recent edges one by one, until the right turn is obtained.
Sweep Line Upper Convex Hull
Input: A set of points X = {v1, ...,«„} in the plane. Output: The directed path UCH(X).
(1) Initialization.
• Set the event structure to be the lexicographically ordered list of points v&lst,..., jJiast. There is no special sweep line structure but the indicator distinguishing the stage of the event.
• Set the active event to be JJactive = v&ist and initiate the UCH(X) as the trivial paths with one vertex »jf}rst (this is the current last vertex of the path in construction).
(2) Main cycle.
• Set the active event to the next point v in the queue, and consider the potential edge e having JJactive as the tail and the last vertex in UCH(X) as head.
• Check whether the UCH(X) is to the left of e (check it only against the last edge in the current UCH(X)).   If so, add e and the JJactive to the UCH(X). If not, remove edges in UCH(X) one by one, until the test turns positive.
• Repeat the cycle until the next event is the jJiast.
It is easy to check that the algorithm is correct. Exactly n events need to be considered, and at each of it up to 0{n) vertices can be removed in the current UCH(X). This occurs in 0{n2) time, but in fact none of the vertices is added again to the UCH(X) after removal. It follows that the asymptotic estimate for the main cycle run is 0{n) in total and it is the ordering in the initialization dominating with its 0{n log n) time. Clearly the linear 0{n) memory is sufficient and so the optimal solution is achieved again for the convex hull problem.
928
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.3.3. The divide and conquer paradigm. Another very standard idea is to divide the entire problem into pieces, apply recursively the same procedure on them, and merge the partial results together. These are the two phases of the divide and conquer approach. This paradigm is common in many areas, cf. 13.1.10.
With convex hulls, adopt the gift wrapping approach for the conquer phase. The idea is to split recursively the task producing disjoint "left CH(X)" and "right CH(X)" and to merge them by finding the upper and lower "tangent edge" of those two parts.
Divide and Conquer Convex Hull
Input: Set of points X = {vi,..., vn} in the plane, ordered
lexicographically.
Output: The directed path CH(X).
(1) Divide. If n < 3, return the CH(X). Otherwise, split X = Xi U X2 into two subsets of (roughly) same sizes, respecting the order (i.e. all vertices in X1 smaller then those in X2).
(2) Merge.
• Start with the edge joining the largest point in CHlXi) and the smallest in CH(X2), and itera-tively balance it to the lower tangent segment e; to OT(Xi) and CH(X2).
• Proceed similarly to get the upper tangent eu.
• Merge the relevant parts of the CH(Xi) and CH(X2) with the help of e; and eu.
Perhaps the merge step requires some more explanation. The situation is illustrated in the diagram. For the upper tangent, first fix the right-hand vertex of the initial edge joining the two convex polygons. Then find the tangent to the left polygon from this vertex. Then fix the head of the moving edge and find the right hand touch point of the potential tangent. After a finite number of exchanges like this, the edge stabilizes. This is the upper tangent edge eu. Observe that during the balancing, we move only clockwise on the right-hand polygon and counter-clockwise on the other one. Notice also that it is the smart divide strategy which prevents any of the points of the input X\ to appear inside of the CH(X2) and vice versa.
Again the analysis of the algorithm is easy. The typical merge time is asymptotically linear and then the recursive call
929
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
yields O(logn) runs of the procedures. The total estimated time is again O(nlogn). Notice, there is no initialization in the procedure itself. Just assume that the points in X are already ordered. Hence another 0{n log n) time must be added to prepare for the very first call. The memory necessary for running the algorithm is estimated by 0{n) if the recursions are implemented properly.
13.3.4. The incremental paradigm. This approach consists in taking the input objects one by one and consecutively build the required resulting structure. This is particularly useful if the application requires this, not having all the points available in the beginning. Imagine incrementally building the convex hull of shots into the target, as they occur.
Another good use is in the randomized algorithms, where all the data is known in the beginning, but treated in a random order. Typically the expected time for running is then very good, while there might be much less effective, but improbable, worst time runs.
The former case is easy to illustrate on the convex hull problem. In each step employ the merge step of the very degenerate version of the divide and conquer algorithm, merging CH(X1) of the previously known points X1, while X2 = CH(X2) = {vk} is just the new point. But an extra step is needed to check whether is inside of CH(Xi) or not. If it is, then skip the new point and wait for the next one. If not, then merge. The worst time of this algorithm is 0{n2), but as with the gift wrapping method, it depends on the actual size of the output as well as on the quality of the algorithm checking whether is    inside of the CH(Xi).
We illustrate the second case with a more elaborate convex hull algorithm. The main idea is to keep track of the position of all points in X with respect to the convex hull CH(Xk) of the first k of them (in the fixed order chosen at the beginning).
With this goal in mind, keep the dynamical structure of a bipartite graph G whose first group of vertices consists of those points which have not been processed yet, while the other group contains all the faces of the current convex polygon S = CH(Xk) (call them faces, not to be confused with the edges in the graph in G). Remember the faces in S are oriented. Such a face e is in conflict with the point v if the face is "visible" from v, i.e. v is in the right-hand halfplane determined by e. Keep all points joined to each of their faces in conflict in the bipartite graph. Call G the graph of conflicts. The algorith can now be formulated:
930
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Randomized Incremental Convex Hull Algorithm
Input: A non-empty set of points X = {v1,..., vn}, n > 3, in the plane.
Output: The edge list R of the convex hull CH(X).
(1) Initialization. Fix a random order on X. Choose the first three points as X0, create the list of conflicts for the edge list R = CH(X0) (i.e. state which of the three faces are seen from which points) and remove the three points from X.
(2) Main cycle. Repeat until the list X is empty:
• choose the first point v e X;
• if there are some conflicts of v in G, then
- remove all the faces in conflict with v from both R and G,
- find the two new faces (the uper and lower tangents from the new point v to the existing CH(X) - they are easily found by taking care of the "vertices without missing edges"),
- add the two new faces to both R and G and find all their conflicts;
• remove the point v from the list X, and from the graph G.
The complete analysis of this algorithm is omitted. Notice that finding the newly emerging conflicts is easy since it is only necessary to check the potential conflicts of points which are in conflict with the faces incident to those two vertices in G before the update, to which the two new faces are attached.
It can be proved that the expected time for this algorithm is 0{n log n), while the worst time is 0{n2). The complete framework for the analysis of randomized algorithm is nicely explained in the book mentioned in the very beginning of this part, see page 926.
13.3.5. Convex hull in 3D.  Many applications need convex hulls of finite sets of points in higher di-i, mensions, in particular in 3D. There are several ways of adjusting 2D algorithms to their 3D versions.
First, it needs to be stated what the right structure is for the CH(X). As seen in 13.1.21, the convex polyhedra in R3 can be perfectly described by planar graphs. In order to modify the algorithms into 3D, some good encoding for them is needed. We want to find all vertices with edges or faces which are incident or neighbouring in time proportional to the output.
931
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
This is nicely achieved by the double connected edge
lists.
Double connected edge list - DCEL
Let G = (V E, S) be a planar graph. The double connected edge list is the list E such that each edge e is equipped with the pointers
• VI, V2 to tail and head of e
• Fl, F2 to the two incident faces (the left one with respect ot the directed edge e first)
• PI and P2 to the edges following along the face Fl and along the face F2, respectively (in the counter clockwise directions).
At the same time, keep the list of vertices and the list of faces, always with just one pointer towards some of the incident edges.
Next, look at the incremental algorithm above and try to imagine what needs changing there to get it to work in 3D. First, identify the real faces S of the DCEL of the convex hull and instead of their boundary vertices, deal with boundary edges. Again, all the faces with conflicts with the just processed points have to be removed. This leads to a directed cycle of edges with one of the pointers Fi missing (call them "unsaturated edges"). Finally, instead of adding two new faces in 2D, add the tangent cone of faces joining the point v with the unsaturated edges must be added. Of course the graph of conflicts must also be updated.
Randomized Incremental 3D Convex Hull Algorithm
Input: A non-empty set of points X = {v1,..., vn}, n > 4, in the space R3.
Output: The DCEL R for the convex hull CH(X).
(1) Initialization. Fix a random order on X. Choose the first four points as Xq, create the list of conflicts G and the DCEL R for CH(X0) (i.e. tell which of the four faces are seen from which points) and remove the four points from X.
932
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
(2) Main cycle. Until the list X is empty, repeat:
• take the first point v e X;
• if there are some conflicts of v in G, then
- remove all the faces of R in conflict with v (from both R and G), and take care of the edges e in R left without one of the incident faces,
- build the "tangent cone" from the new point v to the current R by connecting v to the latter "unsaturated" edges,
- add the new faces to both R and G and find all their conflicts (again, note that the check for new conflicts can be restricted to the points which were in conflict with the faces incident to those edges where the cone has been attached to the previous R).
• remove the point v from the list X and from the graph G.
A detailed analysis is omitted. As with the 2D case, the expected running time for this algorithm is 0{n log n). By the very well adapted DCEL data structure for the convex hull, it is a very good algorithm.
The divide and conquer algorithm from the 2D case can be easily adapted, too. Skipping details, the initial ordering of the input points lexicographically allows to recursively call the same procedure producing two DCELs of disjunct convex polytopes. This allows us to apply a more sophisticated "gift wrapping" approach when merging the results. A sort of "tubular collar" wrapping of the two polytopes to create their convex hull is desired. Imagine rotating the hyperplanes similarly as with the lines in 2D in order to get the first edge in the tubular set of faces to be added. Then the first plane containing one of the missing new faces is obtained. Continue breaking the plane along the new edges, until both directed cycles along of which the collar is attached to the previous two polytopes are closed. All that is done by bending the planes by the smallest possible angle in each step and checking to arrive at the right position. Of course, the DCEL structure is essential to update all the data properly in time proportional to the size of the changes.
With reasonably smart implementation, this algorithm achieves the optimal 0{n log n) running time.
Both of the latter algorithms can be generalized to all higher dimensions, too.
13.3.6. Voronoi diagrams.   The next stop is at one of the
most popular and useful planar divisions (and searching in them). For a finite set of points X = {v1,..., vn} in the plane R2, there is the obvious equivalence relation ~ on R2. Define
v ~ w if and only if they share the uniquely given closest point v e X. Write VR(vi) for the equivalence class corresponding to v{. Define:
933
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
VORONOI DIAGRAM
For a given set of points X = {v1,. ..,«„} (not all colinear), the Voronoi regions are
VR{v{) = {xe R2; \\x -Vi\\< \\x - vk\\,
for all vk G X, k ^ i}.
This is an intersection of n — 1 open half-planes bounded by lines, so it is an open convex set. Its boundary is a convex polygon.
The Voronoi diagram VD(X) is the planar graph whose faces are the open regions VR(vi), while the boundaries of VR(vi) yield the edges and vertices.
Care is needed about collinearity since if all the points vt are on same line in R2, then their Voronoi regions are strips in the plane bounded by parallel lines. Under all other circumstances, the planar graph VD(X) from the latter definition is well denned and connected.
By definition, the vertices p of VD(X) are the points in the plane, such that at least three points v,w,u e X are the same distance from p and no more points of X are inside the circle through v,w,u. If there are no more points of X on the latter circle, then the degree of this vertex pis 3.
The most degenerate situation occurs if all the points of X are on one circle. Then, obviously, the Voronoi regions are delimited by two half lines, all emanating from the center of the circle and cutting the angles properly. The construction of the VD(X) is then equivalent to ordering of the points by angles. At least 0{n log n) time is needed for the worst case estimate in any algorithm building the Voronoi diagrams.
Some of the Voronoi regions are unbounded, others bounded. If just two points v and w are considered, then the axis of the segment vw is the boundary for the regions VR(y) and VR(w). In particular, the region VR(y) must be bounded for each v in the interior of the convex hull of X. On the contrary, consider an edge in the CH(X) with incident vertices v and w, and the "outer" half-axis of the segment vw. If one considers any interior point u in the CH(X), then it is in the other halfplane with respect to the segment vw. Sooner or later the points in the latter half-axis are closer to v and w than to u. It follows that both VR(y) and VR(w) are unbounded. Summarizing:
Lemma. Each Voronoi region VR(v) of the Voronoi diagram VD(X) is an open convex polygon region. It is unbounded if and only if v belongs to the convex hull of X.
13.3.7. An incremental algorithm. The general idea, namely how to update the DCEL of the given VD(X) if a new point p should be added to X is easy. First find which VR(y) the point p hits. Then choose the center v, split the region VR(y) by the relevant part of the axis of the segment pv. Add this new edge e into the updated VD(X), simultaneously creating the two new faces and removing the VR(y) one. The new edge e hits the boundary of the current VR(y) in
934
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
either two points or one point (if the new edge is unbounded). These hits show what is the next region of the updated diagram to be split. "Walk" further with the new hit at the boundary playing the role of p above. Ultimately this walk consecutively splits the visited old regions and creates the new directed cycle of edges bounding the new region, or it has an unbounded path of boundary edges, if the new region is unbounded. See the diagram for an illustration.
If the new point p is on the boundary, i.e. hitting one of the edges or vertices in VD(X), then the same algorithm works. Just start with one of the incident regions.
So far this looks easy, but how does one find the relevant region hit by the new point? An efficient structure to search for it on the run of the algorithm is desired. Build an acyclic directed graph G for that purpose. The vertices in G are all the temporary Voronoi regions as they were created in the individual incremental steps. Whenever a region is split by the above procedure, a new leaf in G is created. Draw edges towards this leaf from all the earlier regions which have some nontrivial overlap. Of course, care must be taken how the old regions are overlap with the new ones, but this is not difficult. We illustrate the procedure on the diagram, updating from one point to four points.
935
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Incremental Voronoi Diagram
Input: The set of points X = {vi,..., vn} in the plane, not all collinear.
Output: The DCEL of VD(X) and the search graph G.
(1) Initialization. Consider the first two points X0 = {vi,v2J and create the DCEL for VD(Xo) with two regions. Create the acyclic directed graph G (just root and two leaves).
(2) Main cycle. Repeat until there are no new points z e X:
• localize the VR(y) hit by z (by the search in G)
• perform the path walk finding the boundary of the new region VR(z) in VD(X)
• update the DCEL for VD (X) and the acyclic directed search graph G.
This algorithm is easy to implement. It produces directly a search structure for finding the proper Voronoi regions of given points. Unfortunately, it is very far from optimal in both aspects - the worst running time is 0(n2), and the worst depth of the search acyclic graph is 0{n). If this is treated as a randomized incremental algorithm, the expected values is better, but not optimal either. Below is a useful modification via triangulations.
13.3.8. Delaunay triangulation. One remarkable feature of the Voronoi diagrams should not remain unnoticed. Right after the definition of the Voronoi diagram, an important fact was mentioned. The vertices of the planar graph VD(X) are centers of circles containing at least three points of X, and no other points of X are inside of the circle. If the dual graph to VD(X) (see 13.1.22 for the definition) is considered, then this is again a tessellation of the plane into convex regions. It is called the Delaunay tessellation DT(X). In the generic case, the degrees of all vertices in VD(X) are 3 (i.e. no four points of X lie on one circle). This is the Delaunay triangulation.13 Notice that it easy to turn any Delaunay tesselation into a triangulation by adding the necessary edges to triangulate the convex regions with
13Although the name sounds French, Boris Nikolaevich Delone (1890 - 1990) was a Russian mathematician using the French transcription of his name in his publications. His name is associated with the triangulation because of his important work on this from 1934.
936
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
more edges. Any of these refined tesselations is called the Dalaynay triangulation associated to the VD(X).
In general, a planar graph T is called a triangulation of its vertices A c R2, \X\ = n, if all its bounded faces have just 3 vertices. It is easy to see that each triangulation T has r = 2ti—2—k triangles and v = 3n—3—k edges, where k is the number of vertices in the CH(X). By the Euler formula (13.1.20) n—u+t+1 = 2 (there is an unbounded face on top of all the triangles). Now, every triangle has 3 edges, while there are k edges around the unbounded face. It follows that 3r + k = 2v. It remains to solve the two linear equations for r and v.
The triangulations are extremely useful in numerical mathematics and in computer graphics as the typical background mesh for processing of approximate values of functions. Of course, there are many triangulations on a given set and one of the qualitative requests is to aim at triangles as close to the equilateral triangles as possible. This could be phrased as the goal to maximize the minimal angles inside the triangles.
A practical way to do this is to write the angle vector of the triangulation
■A(T) = (7l,72,---,73r),
where jj are the angles of all the triangles in T sorted by their value, 7j < jk for all j < k.
A triangulation T on A is said to be angle optimal, if A(T) > A(Tr) for all triangulations V on the same set of vertices A, in the lexicographic ordering. In particular, an angle optimal triangulation achieves the maximum over the minimal angles of the triangles.
Surprisingly, there is a very simple (though not very effective) procedure to produce (one of) the angle optimal triangulations. Consider any two adjacent triangles and check the six angle sequences of their interior angles. If the current position of the diagonal edge provides the worse sequence, flip it. See the diagram.
The flip is necessary if and only if one of the vertices outside the diagonal is inside the circle drawn through the remaining three vertices.
Since each such nipping of an edge inside of a tringula-tion T definitively increases the angle vector, the following algorithm must stop and achieve an angle optimal triangulation:
937
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Edge Flipping Delaunay
Input: Any triangulation T of points X in the plane. Output: An angle minimal triangulation T of the same set X.
(1) Main cycle. Repeat until there are no edges to flip: • Find an edge which should be flipped and flip it.
Theorem. A triangulation T on a set of points X in the plane R2 is angle optimal if and only if it is a Delaunay triangulation associated to the Voronoi diagram VD(X).
Proof. Consider any Delaunay triangulation T associated to VD(X) and one of the vertices p of VD(X). Let vi,..., Vk be all the points of X lying on the circle determining p. Fix an edge with two neighbouring endpoints on a circle. All triangles with the third vertex on the circle above the edge share the same angle. A simple check now verifies that different ways of triangulating the same region of VD(X) with more than 3 boundary edges lead always to the same angle vector.
In particular, there are no flips at all necessary in the above algorithm if one starts with the Delaunay trinagulation T. Hence the angle optimal triangulation is arrived at.
In order to prove the other implication, recall the comments on the diagram above. All triangles in the angle optimal triangulations T have the following two properties: (1) the circle drawn through their three vertices do not include any other point in its interior: (2) the circle having any of their edges as diameter does not have any other point in its interior.
Consider the dual graph G of T and consider its realization in the plane by drawing the vertices as the centers of the circles drawn through the vertices of individual triangles, while the edges are the segments joining them. If there are not more than 3 points on any of those circles, then G = VD(X) is obtained. In the degenerate situations, all the triangles sharing the circle produce the same vertex in the plane and some of the relevant edges degenerate completely. Identify those collapsing elements in G, to get the right VD(X). □
13.3.9. Incremental Delaunay. We return to the general
idea for the Voronoi diagram, namely to design an algorithm which constructs both VD(X) and DT(X) and which behaves very well in its randomized implementation.
The idea is straightforward. Use the incremental approach as with the Voronoi diagrams for refining the consecutive Delaunay triangulations, employing the flipping of edges method.
By looking at the diagram, the Voronoi algorithm is easily modified. Care must be taken of three different cases for the new points - hitting the unbounded face, hitting one of the internal triangles, hitting one of the edges.
938
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Incremental Delaunay Triangulation
Input: The set of points X = {vi,..., vn} in the plane, not all collinear.
Output: The DCEL of both DT(X) and the search graph G for this triangulation.
(1) Initialization. Consider the first three points Xq = {vi,v2, v3}. Create the DCEL for DT(X0) with two regions, and create the CH(Xo) (the connected edge list). Create the acyclic directed graph G (just root and two leaves).
(2) Main cycle. Repeat until there are no new points z e X:
• Localize the face A inDT(Xk) hit by z (by the search in G)
• if z is in the unbounded face, then
- add the new triangles A1,..., Ai to DT(Xk) by joining z to visible edges in CH(Xk)
- update the CH(X).
• Hz hits a (bounded) triangle A, then split it into the three new triangles A1,A2,A3.
• if z hits an edge e, then split the adjacent bounded triangles into A1,..., A4 (only two, if an edge in CH{Xk)) is hit).
• Create a queue Q of not yet checked modified triangles and repeat as long Q is not empty:
- take the first A from the queue Q, look for its neighbour not in Q and not yet checked, and flip the edge if necessary;
- if an edge is flipped, put the newly modified triangles into Q
Detailed analysis of the algorithm is omitted. It is almost obvious that the algorithm is correct. It is only necessary to prove that the proposed version of the flip edge algorithm update ensures that after each addition of the new point z in fcth step, the correct Delaunay triangulation of Xk arises. Once an edge is flipped, then it is not necessary to consider it any time later.
Finally, if the Voronoi diagram is needed instead, it can be obtained from the DT(X) in linear time. Obviously the search structures can be used directly.
Surprisingly enough, it turns out that the expected number of total flips necessary over all the run is of size
939
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
0{n log n). Hence the algorithm achieves the perfect expected time O (n log n). Detailed analysis of this beautiful example of results in computational geometry can be found in the section 9.4. of the book by Berg at al., cf. the link on page 925.
13.3.10. The beach line Voronoi. The Voronoi algorithm provides a perfect example for the the sweep line paradigm, where the extra structure to keep track of the events has to be quite smart.14
Imagine the horizontal line L (parallel to the x axis) flows in the top-down direction and meets the points in X = {vi,..., vn}. Of course VD(X) including all points above the current position of L cannot be drawn, since it depends also on the points below the line. It is better to look at the part Rl in the plane,
Rl = {p G K2; dist(p, L) > dist(p, vt), vt e X above L}.
This is exactly the part of R2 which can be tesselated into the Voronoi diagram with the information collected at the current position of L. Obviously, Rl is bounded by a continuous curve consisting of parts of parabolas, since for one point v{ this is the case, and the intersection of the conditions is relevant.
Call the boundary of Rl the beach line Bl- The vertices on Bl draw the VD(X) when L is moving. Since the Voronoi diagram consists of lines, we do not even compute the parabolas and take care of the arrangements of the still active parts of parabolas in the beachline, as determined by the individual points. New parts of the beachline arise when the line L meets one of the points. Add all the points to an ordered list in the obvious lexicographic order. Call them the point events. The active arc in the beachline disappears if the line L meets the bottom of the circle drawn through three points determining a vertex in the Voronoi diagram. Such an event is called a circle event.
Both types of the events are illustrated in the diagram above. There is a striking difference between them.
The point events always initiate a new arc and start "drawing" two edges of the Voronoi diagram. They initiate previously unknown circle events.
This is mostly called the Fortune Voronoi algorithm. Not because this is such a lucky construction, but rather because the algorithm was published by Steven Fortune of the Bell Laboratories in 1986
940
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The circle events might disappear without creating a genuine vertex in VD(X). Look at the diagram at the s point event. The new s,r,q circle event is encountered there. But this would not create a vertex in the diagram if there was the next point u somewhere close enough to the indicated vertex. One could find it out as soon as such a point event u is met. Such "ineffectively disappearing" circle events are called false alarms. On the contrary the p, q, r circle event shown in the diagram gives rise to the indicated vertex.
Summarizing, the emerged circle events must be inserted properly into the ordered queue of events and handled properly at each of the point events.
Further details are not considered here. When implemented properly, this algorithm runs in the optimal 0(n log n) time and 0{n) storage. See again the above mentioned book by Berg et al. (section 7) for details.
13.3.11. Geometric transformations. Various geometric transformations of the basic ambient space can often help to transform one problem into another one. This is illustrated in a beautiful construction relating the convex hulls and the Voronoi diagrams. Of course transformations which behave well on lines and planes and preserve incidences should be well thought of. The affine and projective transformations behave well in this respect as in the fourth chapter. Introduce a more interesting one - the spherical inversion. In the plane R2, consider the unit circle x2 + y2 = 1. For arbitrary v = (x, y) =^ (0,0) define
Clearly p is abijection on R2 \ {(0,0)}. The geometric meaning of such a transform is clear from the formula, see the diagram. "The general" point v is sent to a point on the same line through the origin, but with reciprocal size. The unit sphere is the set of all fixed points.
The same principle works for all dimensions, so we may equally well (and more interestingly) consider v e R3 in the sequel.
Next follows the crucial property of p.
Lemma. The mapping p maps the spheres and planes in R3 onto spheres and planes. The image of a sphere is a plane if and only if the sphere contains the origin.
941
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Proof. Consider a sphere C with the center c and radius r. The equation for its general points p reads
\\p-42 = r2.
By drawing a few images as in the diagram above, it is easily guessed that the images will be a circle with the center s = ||e||21_;,2 c (i.e. again on the same line through the origin). Now consider q = ip{p) and compute (using just
2p ■ c = ||p||2 + ||c||2 — r2 from the latter equation)
4
_P___C ||2
IMI2 \\c\\2-r2"
1 +7^^M?-h p-°
\\P\\2     (l|c||2-r2)2" " (l|c||2-r2)||p||2 1        „ ,„ 1
(||c||2 -r2)2"
2
| C112 — 1
The latter computation assumes || c|| ^ r. Fix the center c and consider diameter approaching r from below or above. Then the image is a circle with the fixed center s and a fast growing radius. In the limit position, the line is obtained as requested. (Check the computation directly yourself, if any doubts.) □
The continuity of ip has got important consequences. Consider a general plane p (not containing the origin). The inversion p maps one of the half-spaces determined by p to the interior of the image sphere. The other half-space maps to the unbounded complement of the sphere. The latter is of course the half-space containing the origin.
The efficient link between the Voronoi diagrams and convex hulls can now be explained. Assume a set of points X = {v1,..., vn} in the plane is given. View them as the points in the plane z = 1 in R3, i.e. add the same coordinates z = 1 to all the points (x, y) in X. For simplicity, assume that no three of them are collinear end no four them lie on the same circle.
The spherical inversion p maps the entire plane z = 1 to the sphere S with center c = (0,0,1/2) and radius 1/2. Write w1,..., wn for the images w{ = p(vi).
Now, consider CH(Y) for the set of the images Y = {w1,... ,wn}. This is a convex polytope with all vertices on the sphere S. All its faces represent planes not containing the origin (this is due to the assumption that no three points of X are collinear).
Split the faces of CH(Y) into those "visible" from the origin and the "invisible" ones. In the latter case, all points of Y are on the same side of plane p generated by the face as the origin. This implies that all the other points are outside of the image sphere = p(p). In particular, there are no points of X inside of the intersection of S*M with the plane z = l. This is the denning condition for obtaining one of the vertices of the Voronoi diagram. Since the map p preserves incidencies, the entire DCEL for VD(X) is easily reconstructed from the DCEL of CH(Y) and vice versa.
942
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
This resembles the construction of the dual graph, i.e. the Delaunay triangulation DT(X) from the Voronoi diagram, with further geometric transformation in the back.
Last, but not least, the faces of CH(Y) visible from the origin are worth mentioning too. For the same reason as above, all the points of Y appear on the other side from the origin and so, all the points in X are inside the image sphere. This means the diagram of furthest points instead of the Voronoi diagram of the closest ones is obtained.
This is a very useful tool in several areas of mathematics, see some of the exercises (??) for further illustration.
4. Remarks on more advanced combinatorial calculations
13.4.1. Generating functions. The worlds of discrete and continuous mathematics meet all the time. There are already many instances of useful interactions. With some slight exaggeration, we can claim that all results in analysis were achieved by an appropriate reduction of the continuous tasks to some combinatorial problem (for instance, integration of rational functions is reduced to partial fraction decomposition). In the opposite direction, we demonstrate how handy continuous methods can be.
We begin with a simple combinatorial question: There are four 1-crown coins, five 2-crown coins, and three 5-crown coins at our disposal. Suppose we want to buy a bottle of coke which costs 22 crowns. In how many ways can we pay the exact amount of money with the given coins?
We are looking for integers i,j,k such that i+j +k = 22
and
i e {0,1,2, 3,4}, j e {0,2,4, 6,8,10}, k e {0,5,10,15}.
Consider the product of polynomials (over the real numbers, for instance)
(x° + x1 + x2 + x3 + xA)(x° + x2 + xA + x6+
+ xs + x10)(x° + x5 + x10 + x15).
It should be clear that the number of solutions equals the coefficient at x22 in the resulting polynomial. This corresponds to the four possibilities of choosing the values i, j, fc:3-5 + 3- 2 + l-l,3-5 + 2- 2 + 3-l, 2-5 + 5- 2 + 2-1, and2-5 + 4-2 + 4-l.
This simple example deserves more attention.
The coefficients of the particular polynomials represent sequences of numbers, referring to how many times we can achieve the given value with one type of coins only. Work with an infinite sequence to avoid a prior bound on how many available values there can be. Encode the possibilities in infinite sequences
(1,1,1,1,1,0,0,...) 1-crowns
(1, 0,1,0,1,0,1,0,1,0,1, 0,0,...) 2-crowns
(1, 0,0,0,0,1, 0,0,0,0,1, 0,0,0,0,1, 0,0,...) 5-crowns.
943
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Each such sequence with only finitely many non-zero terms can be assigned a polynomial. The solution of the problem is given by the product of these polynomials, as noted before.
This is an instance of a general procedure for handling sequences effectively.
Generating function of a sequence
Definition. An (ordinary) generating function for an infinite sequence a= (a0,ai,a2,...)isa (formal) power series
oo
a(x) = ao + a\x + a2x2 + ■ ■ ■ = cnx1.
i=o
The values a{ are considered in some fixed field K, normally the rational numbers, real numbers, or complex numbers.
In practice, there are several standard ways for defining and using generating functions:
- to find an explicit formula for the n-th term of a sequence;
- to derive new recurrent relations between values (although generating functions are often based on recurrent formulae themselves);
- for calculation of means or other statistical dependencies (for instance, the average time complexity of an algorithm);
- to prove miscellaneous combinatorial identities;
- to find an approximate formula or the asymptotic behaviour when the exact formula is too hard to get.
Illustrate of some of these follow.
13.4.2. Operations with generating functions. Several basic operations with sequences correspond to simple operations over power series (which can be easily proved by performing the relevant operation with the power series):
• Component wise, the sum (a, + b{) of the sequences corresponds to the sum a(x) + b(x) of the generating functions.
• Multiplication (a ■ a{) of all terms by a given scalar a corresponds to the same multiplication a ■ a(x) of the generating function.
• Multiplication of the generating function a (x) by amono-mial xk corresponds to shifting the sequence k places to the right and filling the first k places with zeros.
• In order to shift the sequence k places to the left (i.e. omit the first k terms), subtract the polynomial bk(x) corresponding to the sequence (a0,..., ak-i,0,...) from a(x), and then divide the generating function by the expression xk.
• Substitution of a polynomial / (x) for x leads to a specific combination of the terms of the original sequence. They can be expressed easily for f(x) = ax, which corresponds to multiplication of the fc-th term of the sequence by the scalar value ak. The substitution f(x) = xn inserts n — l zeros between each pair of adjacent terms.
944
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
The first and second rules express the fact that the assignment of the generating function to a sequence is an isomorphism of the two vector spaces (over the field in question).
There are other important operations which often appear when working with generating functions:
• Differentiation with respect to x: the function a'(x) generates the sequence (ai, 2a2, 3a3,...); the term at index k is (k + l)aj;+i (i.e. the power series is differentiated term by term).
• Integration: the function f* a(t)dt generates the sequence (0, a0, i<2i, i<22, \a3,...); for k > 1, the term at index k is equal to jdk-i (clearly, differentiation of the corresponding power series term by term leads to the original function a(x)).
• Product of power series: the product a(x)b(x) is the generating function for the sequence (c0, c1, c2,...), where
i.e. the terms of the product are up to the same as in the product (a0 + aix + a2x2 + ■ ■ ■ + akxk)(b0 + b\x + b2x2 + ■ ■ ■ bkxk). The sequence (c„) is also called the convolution of the sequences (sn), (bn).
13.4.3. More links to continuous analysis. There are useful examples of generating functions. Most of them are seen when working with power series in the third part of chapter six.
Perhaps the reader recognizes the generating function given by the geometric series:
a(x) = ——— = 1 + x + x2 + ... ,
1 — x
which corresponds to the constant sequence (1,1,1,...). From the sixth chapter, this power series converges for x e (—1,1) and equals 1/(1 — x).
It works the other way round as well: Expand this function into its Taylor series at the point 0. The original series is obtained. This "encoding" of a sequence into a function and then decoding it back is the key idea in both the theory and practice of generating function.
Generally, consider any sequence a{ with ^fa^ bounded. Then there is a neighbourhood on which its generating function converges (see i on page 341). For example, an easy check shows that this happens whenever an | = 0{nk) with a constant exponent k > 0. On this neighbourhood, the generating functions can be worked with as with ordinary functions. In particular, one can add, multiply, compose, differentiate, and integrate them. All the equalities obtained carry over to the relevant sequences.
945
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Recall several very useful basic power series and their sums:
1
1-x
m(l+ *) = £(-!)"-•
n>l
1 — x n
n>l
Ex
n
(2n+11!'
n>0
n>0 V
a2™
^(-1)"(2n)!-
n>0 V '
13.4.4. Binomial theorem.   Recall the standard finite binomial formula (a + b)r = ar(l + c)r =
ar J2n=o c™> wnere r G N> 0 ¥= a, b G C, ^h*^   c = b/a. Even if the power r is not a natural number, the Taylor series of (1 + x)r can still be computed. This yields the following generalization:
Generalized binomial theorem
Theorem. For any r G K, k G N, write
fr\ ^ r(j. _ iyr _ 2)... (r _ k + 1)
k) k\
(in particular (p) = 1, having empty product divided by 1 in the latter formula). The power series expansion
x
k>0
converges on a neighbourhood of zero, for each r£l, The latter formula is called the generalized binomial theorem
In particular, the function ^xy, n G N can be expanded into the series
1     (l + n-l\ (k + n-l\ k
Proof. The theorem is obvious if r G N since it is then the finite binomial formula. So assume r is not a natural number and thus zero is never obtained when evaluating (£).
First, differentiate the function a(x) and evaluate all the derivatives in x = 0. Obviously
aW(0) =r{r-l)...{r-k + 1)(1 + af^o = r(r - 1) ■ ■ ■ (r - fc + 1)
946
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
which provides the coefficients ak = (£) of the series. In 5.4.5 There are several simple tests to decide about convergence of a number series. The ratio test helps here:
rk+l
(r-l)...Q
r r
ak+i z ^   _        (n+i)!       z ^   _ r-n akXk      ~ r(7—1)...(t—n+i)   xk _n+i n!
By the ratio test, the radius of convergence is 1 for all r <£ N.
The generalized binomial formula for negative integers is a straightforward consequence. Substituting —x for the argument just kills the signs appearing in the generalized binomial coefficients. □
13.4.5. Examples.   The formulae with r as a negative in-€5U        teger are very useful in practice. The simplest one is the geometric series with r = — 1. Write down two more of them.
1
(1
n>0
71 + 2
X
_— = V,
(1-x)3     ^\ 2
V ' n>0 V
The same results can be obtained by consecutive convolutions. Indeed, for the generating function a(x) of a sequence (a0, a1, a2,...), jz^a^x) is the generating function for the sequence of all the partial sums (a0, a0+a1, a0+a1+a2,...). For instance,
1  -In- 1
1 — x     1 — x is the generating function of the harmonic numbers
Hn = 1 + ^ + ■■■ + -•
2 71
13.4.6. Difference equations. Typically, the generating functions can be very useful, if the sequences are denned by relations between their terms, gtr^^agg—g ^ instmctive example of such an application is the complete discussion of the solutions of linear difference equations with constant coefficients. This is examined in the second part of chapter one, see 1.2.4. Back there, a formula is derived for first-order equations, the uniqueness and existence of the solution is justified, after only "guessing" the solution. Now, it can be truly derived.
First, sort out the well-known example of the Fibonacci sequence, given by the recurrence
-Fn+2 = Fn + Fn+i, F0 = 0, F1 = 1,
and write F(x) for the (yet unknown) generating function of this sequence. We want to compute F(x) and so obtain an explicit expression for the nth Fibonacci number.
The denning equality can be expressed in terms of F(x) if we use our operations for shifting the terms of the sequence. Indeed, xF(x) corresponds to the sequence (0,F0,Fi,F2,...), and x2F(x) does to (0,0,7^,7^,...). Therefore, the generating function
G(x) = F(x) -xF(x) - x2F(x)
947
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
represents the sequence
(Fo,Fi-Fo.O.O,. ..,0,...).
Substitute in the values F0 = 0, F1 = 1 (the initial condition). Obviously G(x) = x and hence
(1 - x - x2)F(x) = x.
F(x) is a rational function. It can be rewritten as a linear combination of simple rational functions. This is helpful, since a linear combination of generating functions corresponds to the same combination of the sequences.
Rational functions can be decomposed into partial fractions, see 6.2.7. Using this procedure, we a generating function for 1/(1 — x — x2). Namely, write r e N) 1 A B
n*) = 1 „ „2 = ——- +
1 — X — X2       X — X\       X — X2
a b +
1 — Xix 1 — \2x where A, B are suitable (generally) complex constants, and xi, x2 are the roots of the polynomial in the denominator. The ultimate constants a, b, Ai, and A2 can be obtained by a simple rearrangement of the particular fractions. This leads to the general solution for the generating function
oo
F(x) = J2(aXi +b\2)xn,
71 = 0
and so the general solution of the recurrence is known as well.
In the present case, the roots of the quadratic polynomial are 1±2V^. Hence Aii2 = The partial fraction decom-
position equality gives
x = a(l- —^jfx) +6(1
and so a = — b = A^. Finally the requested solution
is obtained. Compare this procedure to the approach in 3.2.2 and 3.B.I. This expression, full of irrational numbers, is an integer. The second summand is approximately (1 — \/5)/2 ~ —0,618. Its value is negligible for large n. Hence Fn can be computed by evaluating just the first summand and approximating to the nearest integer.
Of course, the same procedure can be applied for general fc-th order homogeneous linear difference equations. Consider the recurrence
Fn+k = ol0Fn + ■ ■ ■ + Qjt-lF„+Jt-l.
The generating function for the resulting sequence is
, , q(x) F(x) = 1-k^2-.
1 — OLqXk — ■ ■ ■ — Ctk-iX
where the polynomial g(x) of order at most k — 1 is determined by the chosen initial conditions.
Using partial fraction decomposition, the general result follows as in subsection 3.2.4.
948
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.4.7. The general method. Power series are a much stronger tool for solving recurrences. The point is that one is not restricted to linearity and homogeneity. Using the following general approach, recurrences that seem intractable at first sight can quite often be managed. The first steps are just algorithmic, while the final solution of the equation on the generating function may need very diverse approaches.
In order to be able to write down the necessary equations efficiently, adopt the convention of the logical predicate [S(n)] which is attached before the expression it should govern. Simply multiply by the coefficient 1 if S (n) is true, and by zero otherwise. For instance, the equation
Fn = Fn_2 + Fn_x + [n = 1]1 + [n = 0]1
defines the above Fibonacci recurrence with initial conditions
F0 = 1 and.Fi = 2.
Method to resolve recurrences
Recurrent definitions of sequences (ao,ai,...) may be solved in the following 4 steps:
(1) Write the complete dependence between the terms in the sequence as a single equation expressing an in terms of terms with smaller indices. This universal formula must hold for all n e N (supposing a_i = a_2 = ••• = 0).
(2) Both sides of the equation are multiplied by xn. Then sum the resulting expressions over all n e N. One of the summands is J2n>o anxTl< which is the generating function A (x) for the sequence. Rearrange other summands so that they contain only the terms A(x) and some other polynomial expressions.
(3) Solve the resulting equation withrespectto A(a;) explicitly.
(4) The function A(x) is expanded into the power series. Its coefficients at xn axe the requested values of an.
As an example, consider a second order linear difference equation with constant coefficients, but a non-linear right hand side.
The recurrence is an = 5a„_i — 6a„_2 — n with initial conditions a0 = 0, ai = 1. The individual steps in the latter procedure are as follows:
Step 1. The universal equation is clear, up to the initial conditions. First check n = 0, which yields no extra term, but then n = 1 enforces the extra value 2 to be added. Hence,
an = 5a„_i - 6a„_2 -n+[n = 1]2.
Step 2.
^2 anxn = 5a; ^ an-xxn~x - Qx2 ^ an_2sn"2
n>0 n>0 n>0
n>0
949
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Next, one of the terms is nearly the power series for (1—a;)-2. Thus remove one x there in order to get the equality on A(x) in the form as required (ignore the negative values of indices since all a_i, a_2.... vanish by assumption).
A(x) = 5xA(x) - 6a;2A(a;) - x (1_1x)2 + 2a;.
Step3. Find the roots 2 and 3 of me polynomial l—5a;+6a;2 = (1 — 2x) (1 — 3a;). An elementary calculation yields
2a;3 - Ax2 + x
A(x)
(1 - 2x)(l - 3x)(l - x)2
Step 4. Partial fraction decomposition directly leads to the result
A(x) =----1- 2--h ---—---.
w       4 1-3a;      1 - 2a;     2 (1 - a;)2     4 1 - x
This corresponds to the solution
an = -1-T + 2^-1-n-7-.
The first eight terms in the sequence are 0,1, 3, 6, 8, —1, —59, -296.
13.4.8. Plane binary trees and Catalan numbers. The
next application of the generating functions an-% swers the question about the number bn of non-isomorphic plane binary trees on n vertices (cf. 13.1.18 for plane trees). Treat these trees in the form of the root (of a subtree) with a pair [the left binary subtree, the right binary subtree].
Examine the initial values of n, namely
&0 = l,bi = l,&2 = 2,&3 = 5.
It is more or less obvious that for n > 1, the sequence bn satisfies the recurrent formula
bn = b0bn-i + hbn-2 H-----h bn-ib0,
and this is actually close to a convolution of two equal sequences. Rearrange the expression so that it holds for all
n e N0:
bn =  52 bkbn-k-i + [n = 0]1.
0<fc<n
This finishes step 1 of the procedure.
In step 2, multiply both sides by xn and add it all together. Write B(x) for the generating function of the sequence bn.
B(x) = Y,bkbn-k-ixn + ^> = 0}xn
= [Y.bn-k-^-^+l
=     bkxk{xB{x)) + 1 = B{x) ■ xB{x) + 1.
k
Notice that the convolution bn = b0bn-i + hbn-2 + ■ ■ ■ + bn-ibo is replaced by
bn = b0bn-i H-----1- bn-ib0 + bnb-X + bn+ib-2 H----•
950
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
This is no problem by the convention. It helps with processing the sums (it is much easier to work with infinite sums here than to keep an eye on the bounds all the time).
In step 3, the quadratic equation B(x) = xB(x)2 + 1 must be solved for B(x). So
B(x) = -.
V ' 2x
Although there are two solutions, not necessarily both must produce a valid solution to our problem. The sign + in the formula is impossible since then, the limit of B(x) for x —> 0+ is oo, but the generating function for our sequence must have the value bo = 1 at 0.
In the last step, expand B(x) into a power series. The expansion can be obtained using the generalized binomial theorem
(i-^j^W1/2)^)*
k>l        v 7
Dividing 1 — y'l — 4x by the expression 2x leads to
w-EíC^)'-4*''-'
k>l
l/2\ (-4x)n
n   I n + 1
n>0
Substitute the (-4)™ multiple of (~]{2) into the definition of the generalized binomial numbers. A straightforward check shows (—4)™ (~J/2) = (2^), which yields a final, much neater, formula for the coefficients. We conclude that the number of plane binary trees on n vertices equals
1 /2r^
n + l\n
These are known as the Catalan numbers. They occur surprisingly often:
• the number of well-parenthesized words of length In, i.e. words consisting of n opening and n closing parentheses so that no prefix of the word contains more closing parentheses than closing ones;
• this also corresponds to the number of ways an unsup-plied vending machine can accept n 5-crown coins and n 10-crown coins for 5-crown orders so that it can always give the change (hence the probability that a random ordering is satisfactory can also be found)
• the number of monotonie paths from [0,0] to [n, n] along the sides of the unit squares of the grid such that the path does not cross the diagonal
• the number of triangulations of a convex (n + 2)-gon.
The intuitive reasoning for this is that they come from the expansion of the square root within B(x) and quadratic equalities appear often in real world.
951
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.4.9. Quicksort analysis. The next task is to determine the expected number of comparisons made by the Quicksort, a well-known algorithm for sorting a (finite) sequence of elements. This is the following divide and conquer type of algorithm:
Procedure Qsort
Input: A (non-sorted) list of elements L = (L[0],..., L[n]) Output: The sorted list L with the same elements.
(1) if L is empty, then return the empty list ().
(2) Divide phase. Create a sublist L1 by going through L and leaving only the elements x with L[0] > x, while putting the other elements into the list L2.
(3) Conquer phase. Combine the lists
L = Qsort (Li) + (L[0]) + Qsort(L2) and return the list L.
We analyze how many comparisons are needed. Assume that all possible orderings of the list L to be sorted are distributed uniformly. The following parameters are crucial:
• The number of comparisons in the divide phase is n — 1.
• The assumption uniformity ensures that the probability of L [0] being the fc-th greatest element of the sequence
• The sizes of the sublists to be sorted in the conquer phase are k — 1 and n — k.
There is the following recurrent formula for the expected number of comparisons Cn:
n
(1) Cn = n-\ + Y-(Ck-1+Cn-k).
k=l
One could work the steps of the general method directly, but the symmetry of the two summands allows a rewrite (1), multiplying by n at the same time
n
(2) nC7„ = n(n-l) + 2^C7fc-i.
k=l
In the first step, care is needed concerning about n = 0. In the denning recurrence (1), n = 0 is not treated at all (since the equation does not make sense). So the convention must be extended to include Co = 0 in the computation. Then the equation (2) defines the C\ = 0 properly. It is not necessary to add any terms in view of the initial conditions.
Next, multiply both sides by xn and add
n
Y, nCnxn = Yn(n- l)xn + 2 ^ ^ Ck-Xxn.
n>0 n>0 n>0fc=l
All the terms look familiar. The left hand side shows the derivative of the generating function C(x) = J2n>o Cnxn if one x is removed. The first term on the right is the series for (1 — a;)3, up to a constant and shift of powers by 2. Finally the last term is the convolution with (1 — a;)-1, up to one x
952
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
and the coefficient 2. Hence the equation
(1-x)3 1-x'
The third step is straightforward see 8.3.3. Divide by x to obtain
C (x) =-^—C(x) + 2X
1 -X        ' (1
- f -2- dx
The corresponding integrating factor ise J 1~x x = (1 x)2. Hence
2x
{(l-x)2C(x))'
l-x' and finally
(     1 1 C7(a;) = 2--^ In :
^(1-a;)2     1-x (1
The first terms in the bracket corresponds to the convolution of two known sequences, so it contributes to Cn by
n    1 n 1
£   (n-fc + l) = (n + l)£ n
k=l k=l
= (n + l)Hn -n=(n + l)(Hn+1 - 1), where Hn are the harmonic numbers. The result is
Cn = 2(n + l)(Hn+1-l)-2n.
Notice in 13.1.10, the very same recurrence is solved by different (more direct and simpler) tricks, without any differential equations involved.
Since the harmonic numbers Hn are easily approximated by In n = J™ A dx, the analysis shows that the estimated time cost of quicksort is O (n log n). But it is easy to see that the worst time case is 0{n2) (in this version it happens if the list was already ordered properly - then L\ is always empty and the depth of the recursion is linear).
13.4.10. Exponential generating functions. Another approach to generating functions is to take the exponential ex = J2n>o h.x^ as tnePower series corresponding to the constant sequence (1,1,...). In general, this is called the exponential generating functions
A(x) = J2a^-Here are a few elementary examples:
(1,1,1,...),
1 e.g.f.
1-X
(1,1,2,6, 24,
In--—^4(0,1,1,2,6,24,...)
1 — x
The slight modification of the definition (just forgetting about the ^ coefficient) is responsible for a very different behaviour, compared to the ordinary generating functions. The elementary operations are:
953
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
• Multiplication of A(x) by x yields the sequence with terms an = na„_i.
• Differentiation of A(x) shifts the sequence to the left.
• Integration of A(x) shifts the sequence to the right.
• The product of functions A(x) and B(x) corresponds to the sequence with terms hn = J2k (fc)afc^n-fc, the binomial convolution of an and bn.
As before, the exponential generating functions might become useful when resolving recurrences. Here is a simple example. Define the sequence by the initial conditions go = 0, gi = 1 and the formula
gn = -2ng„_i + ^
fc>0
gkg-n-k-
At the first glance, seeing the binomial convolution suggests trying the exponential version.
Write G(x) for the corresponding power series and proceed in the usual four steps again.
Step 1. Complete the formula to accommodate the initial conditions:
gn = -2ngn-i + ^ ( " \gkgn-k + [n = 1].
k=0
There seems to be a subtle point about go here, because the equation gives go = go2, with two solutions 0 and 1. The proper choice of go now yields the correct value for g\, but the right solution G is chosen later. Step 2. Multiply by     and add over all n, to obtain
G{x) = -2xG(x) + G{x)2 + x.
Step 3. Now, solve the easy quadratic equation, arriving at
G(x) = 1/2(1 + 2x± \/\+Ax2).
The evaluation at zero provides go- Hence the right choice for go = 0 is the minus sign. Hence,
,     1 + 2x - VI + 4a;2 G(x) =-j-•
Step 4. Apply the generalized binomial theorem, to expand G(x) into a power series. Seel3.4.8.
^2fc-2
k  v \k — \
y/l + 4a;2 = 1 + ^ \ ■ (-l)k~1 ■ 2 ■
■x2k.
k>l
Further, since
,     V-    xn     1 + 2a; - VI + 4a;2 G^ = E^ = -~2-,
n>0
52fc+i = 0 and
52fc = (-If ■ ~k ( k _ 1 j ■ (2fc)! = (-l)fc ■ (2fc)! ■ Ck. where C„ is the n-th Catalan number.
954
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.4.11. Cayley's formula. We conclude this chapter by a more complicated example.
Cayley's formula computes the number of trees (i.e. graphs with unique paths between all pairs of vertices) on n given vertices,
K(Kn) = nn~2.
The notation refers to the equivalent formulation to find all spanning trees in the complete graph Kn. Equivalently, in how many ways can a tree be realized on n vertices with the vertices labeled. For example, already the path Pn can be realized in n\ ways, so there must be very many of them. This result is proved with the help of the exponential generating functions.
Write Tn = K(Kn) for the unknown values. It is easily shown that Tx = T2 = 1, T3 = 3, T4 = 16. For instance, consider trees on 4 vertices. Out of the (g) = 20 potential graphs with exactly three edges, those where the edges form a triangle must not be counted. There are (g) = 4 of them. In the diagram, there are four different possibilities, and each of them can be rotated into another three, hence the solution is 16.
The recurrent formula can be obtained by fixing one of the vertices and add together the possibilities for all available degrees of this vertex. This suggests looking rather at the number Rn of the rooted trees. It is clear that Rn = nTn because there are n possibilities to place the root at each of the trees. Also, one can work with one fixed ordering of the vertices in Kn and multiply the result by n\ in the end. In this way, go through the possible degrees m of the first vertex and for each m to find the different possibilities for the sizes ki... ,km of the corresponding subtrees. Obviously k\ + ■ ■ ■ + km = n — 1, all k{ > 0, and since the labeling of all vertices is fixed, all the orders of the subtrees must be considered as equivalent. Multiply the contribution by ^ and similarly for each of the possibilities of the subtrees. The recurrent formula is
m>0        fciH-----\-k7n=n—l
Of course, Ro = 0, R\ = 1 and, already using the formula, R2 = 2ui = 2. Next, R3 = 3u2 + 3V = 9, RA = 4R3 + 24R1R2 + 4i?i3 = 64, all as expected. The first step of the standard procedure is accomplished.
Next, write R(x) = J2n>o ^nhxn- The inner sum is the coefficient at xn~x in the m-th power of the series R(x). Therefore,
nl ^-^ ml
m>0
955
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
and hence have the required equation on R: R(x) =xeM^ . There are several ways, of solving such functional equa-maybe getitas
Hons. Here iS One SUCh tOOl Without prOOf. application of residue
theorem in chapter 9
Theorem (Lagrange inverse formula). Consider an analytic function f, /(0) = 0 and /'(0) =/ 0. Then there (locally) is the analytic inverse of f, i.e w = g(z) = J2n>1 Qn^r and z = f(g(z)). Moreover, for all n > 0,
Sn = ^0(^=1 (/h) )•
In this case, solve the equation x = J^j-, so that we
may apply the latter theorem with g = R and f(w) = ^§-. It follows that
1 /       \7
[xn]R(x) = -K"1] ( J 1   J   w     n1       1 \w/ew)
n(n-l)\ In particular, Rn = nn~1 and so,
T„ = ^ = nn~2.
956
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
J. Additional exercises to the whole chapter
13.J.1. Determine the number of edges that must be added into
i) the cycle graph Cn on n vertices,
ii) the complete bipartite graph Km^n
in order to obtain a complete graph. O
13.J.2. Let the vertices of KG be labeled 1, 2,..., 6 and let every edge {i,j} be assigned the integer [(i + j) mod 3] +1. How many maximum spanning trees are there in this graph? O
13.J.3. Let the vertices of K7 be labeled 1, 2,..., 7 and let every edge {i,j} be assigned the integer [(2 + f) mod 3] +1. How many maximum spanning trees are there in this graph? O
13.J.4. Let the vertices of K5 be labeled 1,2,... ,5 and let every edge {2, j} be assigned the integer: 1 if 2 + j is odd; 2 if 2 + j is even. How many maximum spanning trees are there in this graph? O
13.J.5. Let the vertices of K5 be labeled 1,2,... ,5 and let every edge {2, j} be assigned the integer: 1 if 2 + j is odd; 2 if 2 + j is even. How many minimum spanning trees are there in this graph? O
13.J.6. Let the vertices of KG be labeled 1, 2,..., 6 and let every edge {2, j} be assigned the integer: 1 if 2 + j leaves remainder 1 upon division by 3; 2 if 2 + j leaves remainder 2 upon division by 3; 3 if 2 + j is divisible by 3; How many minimum spanning trees are there in this graph? O
13.J.7. Let the vertices of KG be labeled 1, 2,..., 6 and let every edge {2, j} be assigned the integer: 1 if 2 + j leaves remainder 1 upon division by 3; 2 if 2 + j leaves remainder 2 upon division by 3; 3 if 2 + j is divisible by 3; How many maximum spanning trees are there in this graph? O
13.J.8. Icosian Game - find a Hamiltonian cycle in the graph consisting of the vertices and edges of the regular dodecahedron.
o
Solution. See Wikipedia4. □ 13.J.9. Does there exist a Hamiltonian cycle in the Petersen graph? O Solution. No (however, when any one of the vertices is removed, the resulting graph is already Hamiltonian). This can be shown by enumerating all 3-regular Hamiltonian graphs on 10 vertices and finding a cycle of length less than 5 in each of them. □
13.J.10. If G = (V, E) is Hamiltonian and 0 =/ W C V, then G\W has at most \ W\ connected components.
Give an example of a graph where the converse does not hold. O
13.J.11. Find a maximum flow and the corresponding minimum cut in the following weighted directed graph:
a 7 b
^Wikipedia, Icosian game, http://en.wikipedia.org/wiki/ Icosian_game (as of Aug. 8, 2013, 13:24 GMT).
o
957
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
o
13.J.15. Find the generating functions of the following sequences:
i) (1;2;1;4;1;8;1;16;...)
ii) (1;1;0;1;1;0;1;1;...)
iii) (l;-l;2;-2;3;-3;4;-4;...)
O
Solution.
i) (1; 2; 1; 4; 1; 8; 1; 16;...) = (1; 0; 1; 0;...) + (0; 2; 0; 4; 0; 16;...). Thus, we find the generating functions for each sequence separately. As for the first one, consider the sequence (1,1,1,1,1,...). It is generated by the function The zeros can be inserted by substituting x2 for x. As for the second sequence, we proceed similarly, starting with (1; 2; 4; 8; 16;...), then multiplying by two, inserting zeros, and finally shifting to the right by multiplying by x.
958
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
ii) (1;1;0;1;1;0;1;1;...) = (1; 0; 0; 1; 0; 0; 1...) + (0; 1; 0; 0; 1; 0; 0; 1...).
n   1  4- 2x
l>   l-x2 + l-2x2
Ui)    (l-x2)2   + (l_x2)2
□
13.J.16. Find the coefficient at x17 in (x3 + x4 + x5 + .. .)3. O
Solution, (a;3 + x4 + x5 + ...)3 = (i-x)3 ~ x9 ' (i-x)3 ■ ^e w& thus looking for the coefficient at a;8 in jj—ys- This is equal to (^.i. e. 45. □
13.J.17. There are 30 red, 40 blue, and 50 white balls in a box (balls of the same color are indistinguishable). In how many ways can we pick up 70 balls from the box? O Solution. Clearly, the number of possibilities is equal to the coefficient at xro in the expression
(1 + X + ■ ■ ■ + x30)(l +X + ---+ x40)(l +X + ---+ x50).
Mere rearrangements lead to
(l+x+- ■ ■+x30)(l+x+- ■ -+x40)(l+x+- ■ -+x50) = M  1 v, ... (l-x31)(l-x41)(l-x51).
(1 — x)a
Applying the generalized binomial theorem, we obtain the solution (722) — (4^) — (31) — (221). □ 13.J.18. What is the probability that a roll of 12 dice results in the sum of 30?
Hint: Express the number of possibilities when the sum is 30. Consider {x + x2 + x3 + x4 + x5 + a;6)12. O Solution. The resulting probability is the ratio of the number of favorable cases to the number of all cases. Clearly, the latter is 612. Now, let us compute the number of favorable cases. Consider the expression (x + x2 + x3 + x4 + x5 + x6)12. Then, the number of favorable cases is the coefficient at a;30. We have:
(a;+a;2+a;3+a;4+a;5+-6'112
c(l -a;6)\12      12 fl-xe
1 — x    J \ 1 — x
Therefore, we are interested in the coefficient at a;18 in
1  - T6\ 12 1 12
L^L) = (i _ i2:c6 + QQX12 _ 220^18) . _J_ 1 — X J 1 — X
It follows from the generalized binomial theorem that the number of favorable cases is
□
13.J.19. A fruit grower wants to plant 25 new trees, having four species at his disposal. However, his wife insists that there be at most 1 walnut, at most 10 apples, at least 6 cherries, and at least 8 plums. In how many ways can he fulfill his beloved's wishes?
Hint: We are interested in the coefficient at x25 in the expression (l + x)(l + x + ---+ x10)(x6 + xr + ...)(x8+x9+ ...).
o
Solution.
(l+x)(l+x+- ■ ■+x10)(x6+x7+.. .)(xs+x9+. ..)= ^(l-^Xl-s11).
(1 X)
Therefore, we are looking for the coefficient at x11 in (1 — a;2 — x11...) ■ m}x\a , which is equal to (g4) — (g2) — (3). □
959
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
13.J.20. Express the general term of the sequences denned by the following recurrences:
i) ai = 3, a2 = 5, an+2 = 4an+i — 3a„ for n = 1, 2, 3 ....
ii) a0 = 0,ai = 1, an+2 = 2an+i — 4a„ for n = 0,1, 2, 3 ....
o
Solution.
i) an = 2 + 3™-1.
ii) a„ = iy7^ ■ ((1 4- - (1 - v7^)™).
□
13.J.21. Solve the recurrence where each term of the sequence (ao, ai, a2,...) is equal to the arithmetic mean of the preceding two terms. O
Solution. an = k (—§)" + 1. □
13.J.22. Solve the recurrence an+2 = y'an+ian with the initial conditions a0 = 2, ai = 8.
Create a new sequence bn = log2 an. O
13.J.23. Solve the recurrence given by
En\ak .
fc>0
Multiply both sides by     a«<i sum (f wp. TVote f/wf        is the exponential generating function for the sequence
(an)- O
13.J.24. Find the number of triangulations of a convex n-gon.
Hint: Select any diagonal that goes through a fixed vertex, this splits the polygon in two. O
Solution. tn = Cn-2, where Cn denotes the n-th Catalan number. □
13.J.25. Find the number of walks in a square grid of size n x n from the lower left-hand corner A to the upper right-hand corner B which go only upwards or rightwards and intersect the diagonal AB at exactly one point (besides A and B).
Hint: Catalan numbers. O
13.J.26. Prove that the Fibonacci number satisfy:
i) F2 + FA + ■ ■ ■ + F2n = F2n+1 - 1
ii) F1+F3 + ---+F2n-1=F2n
o
13.J.27. Recall the well-known puzzle Tower of Hanoi and let Hn denote the minimum number of steps necessary to move a tower consisting of n disks from one rod to another one. Find a recurrent formula for Hn as well as its general solution. O
Solution. Hn+1 = 2Hn + 1, Hn = 2n - 1. □
960
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
At the very end of the book, we present one problem from practice.
13.J.28. A volleyball team (with a libera, i. e. 7 people) are sitting in a pub, drinking their favorite and well-deserved beer. However, there are only 7 beer mugs available. What is the probability that in the next round, i) exactly one volleyball player is not given the same mug as the last time,
ii) no one is given the same mug as the last time,
iii) exactly three players are given the same mug. Solution.
i) If six of the seven people are given the same mug, then so must the last one. Therefore, the probability is zero.
ii) Let M be the set of all orders of the 7 players and let A{ be the event of orders where the i-th player is given his mug. We want to calculate \M -\JtAt\. We get 7! Yl=o ^~W~~ = 1854' so the probability is ±SM = m = o,37.
iii) There are Q) = 35 ways to select the three people who are to get the same mug. The remaining four people must be given different mugs. Again, we can apply the formula from above, i. e., there are 4! Y^t=o ^~k\ ~ ® possibilities. Altogether, there are 9 ■ 35 = 315 favorable cases, so the probability is       = jg.
961
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
Key to the exercises 13.A.7. The cut vertices are 0,1, 9,10; the cut edges are (0,1), (0,12), (9,10).
13.B.3. (3,1), (3, 2), (3,4), (3, 5), (3, 6), (6,1), (6, 2), (6,4), (6, 5), (5,1), (5, 2), (5,4), (4,1), (4, 2), (2,1). 13.B.4. (3,1), (3,2), (3,4), (3,5), (3,6), (1,2), (1,4), (1,5), (1,6), (2,4), (2,5), ...(5,6).
13.B.12. It can be shown using the Havel-Hakimi procedure that such graph indeed exists. However, it cannot be planar: \V\ = 10, \E\ =
35, but if it were planar, we would have 3\V\ — 6 > \E\, i. e., 24 > 35.
13.B.15.
i) Yes. This follows immediately from the Kuratowski theorem (K5 has 10 edges and K3,3 has 9).
ii) No. Consider K5 or K3,3.
iii) No. There are many counterexamples, for instance K3,3 with another vertex and an edge leading to it.
iv) No. Consider K5.
v) No. Consider K3,3.
vi) The same as (ii).
vii) No. Consider Cn.
viii) No. Consider K5. ix) No. Consider Cn.
13.B.17. The first code does not represent a tree (it has a proper prefix with the same number of zeros and ones). There is a tree corresponding to the second code.
13.C.4. The procedure is incorrect. As a counterexample, consider a cycle graph with one edge of length two and all other edges of length one.
13.C.5. Applying any algorithm for finding a minimum spanning tree, we find out that the wanted length is 12154 (the spanning tree consists of edges LPe, LP, LNY, PeT, MCNY).
13.D.5. We find a maximum flow of size 15 and the cut [1,6], [1, 3], [2,4], [2,3] of the same capacity.
13.D.7. We know from the theory and the result of the above exercise that the minimum capacity of a cut is 9. There are more maximum flows in the network. For instance, we can set f{a) = 2, f(b) = 4, f(c) = 1, f(h) = 1, f(j) = 4, f(f) = 2, f(i) = 7, and f{v) = 0 for all other edges v of the graph. 13.E.7.
_ 4! -4! _ 27 "T-35
13.E.8. |f. 13.1.4.
i) We know from the exercise of subsection 13.4.3 that the generating function of the sequence (1, 2, 3,4,...) is (1j)xyA ■
ii) Since we have (by the previous exercise as well)
X o.g.f.
(1 -x)2
mAAme pro derivaci tÁŠto funkce
x       \ 1 + X o.g.f.
(0,1,2,3,...),
[l-x)2l (1-x)3
(1-1,2-2,3-3,...).
Let us emphasize that this problem could also be solved using the fact that jpr^ iii) We have
1    £!4 (1,1,1,1...),
o.gf.
fn + 2\
1-X 1 o.g.f.
1 - 2x
1 o.g.f.
1 - 2x2
X o.g.f,
1 - 2x2
(1,2,4,8,...),
(1,0,2,0,4,0,...),
(0,1,0,2,0,4,...),
962
CHAPTER 13. COMBINATORIAL METHODS, GRAPHS, AND ALGORITHMS
whence we get the result
1 + x o.g.f.
!_2x2 •   ■ (1,1,2,2,4,4,8,8,...).
o.g.f.
iv) We know from the above that f(x) = <^ (l2, 22, 32,...), hence
f{x) - (1 + 4x)    o.g.f.       2  ^ 52
X2 ,      , ,
Substituting 2x3 for x, we obtain
/(^)-a + ^3) ^ (9,0,0, 2. 16,0,0,4. 25,...).
v) If we denote the result of the previous problem as F(x), then the result of this one is
F(x)-x2F(x) + T^.
13.1.11. x/(l-3x + x2) 13.J.1.
i) The complete graph on n vertices has "^■"~1'> edges, the cycle graph on n vertices has n edges. Therefore, "^■"~1'> — n edges must be added to the cycle graph.
ii) Similarly as above, we get the result (m+")(^1+"~1) _ m ■ n.
13.J.2. There are five edges whose value is 3: four of them lie on the cycle 23562 and the remaining one is the edge 14. Therefore, they form a disconnected subgraph of the complete graph, so the spanning tree must contain at least one edge of value 2. Thus, the total weight of a maximum spanning tree is at most 4-3 + 2 = 14. And indeed, there exist spanning trees with this weight. We select all the edges of value 3 except for one that lies on the mentioned cycle and connect the resulting components 2356 and 14 with any edge of value 2. There are four such edges. Altogether, there are 4 • 4 = 16 maximum spanning trees.
13.J.3. The edges of value 1 form a subgraph with two connected components, namely {1, 2,4, 5, 7} a {3, 6}. Further, there are six edges of value 2 that lead between these two components. Therefore, the total weight of a minimum spanning tree is 6 • 1 + 2 = 8. Moreover, there are exactly three cycles in the former component, each of length 4, and each of the 6 edges of this component belongs to exactly two of the three cycles. In order to obtain a tree from this component, we must omit two edges, which can be done in 6 • 4/2 ways. Altogether, we get 12 • 6 = 72 minimum spanning trees. 13.J.4. 18. 13.J.5. 12. 13.J.6. 16. 13.J.7. 16.
13.J.11. The minimum cut is given by the set {Z, A, E}. Its value is 32.
13.1.12. The minimum cut is given by the set {B, D, S}. Its value is 40. 13.J.13. The minimum cut is given by the set {F, S, D}. Its value is 29. 13.J.14. The minimum cut is given by the set {F, S}. Its value is 39.
963
Based on the earlier textbook: Matematika drsně a svižně Jan Slovák, Martin Panák, Michal Bulant a kolektiv
published by Masarykova univerzita in 2013 1. edition, 2013 500 copies
Typography, IáTpX and more, Tomáš Janoušek
Print: Tiskárna Knopp, Černčice 24,549 01 Nové Město nad
Metují